A Supervision Manager has asked you to help in allocating scarce resources and identify which firms their team should focus on. Supervisory resource may be allocated according to the following characteristics:
The aim is to achieve an optimal allocation of supervisory resources by steering supervisors in the direction of important or ‘risky’/’outlying’ firms, so as to facilitate the best use of their time and reduce manual effort as much as possible.
An optimum allocation of supervisory resources is a key task for a regulatory authority such as the Prudential Regulatory Authority (PRA). The data provided is rich in financial metrics for insurance firms such as the Gross and Net Written Premium, Total assets and liabilities, various financial ratios, etc. It also has a time series component which has been aptly considered. With the aim of achieving the aforementioned motive, I have evaluated several dimensions including the firm’s size, dynamic business profile over time and anomalies from the norm using multiple approaches including exploratory data analysis and machine learning algorithms.
The data (325 firms in dataset 1 and 456 firms in dataset 2) is provided in an Excel (.xlsx) file in the wide format and spans a period of 5 years from 2016 till 2020. This is converted to the long format for ease of manipulation and further engineering, and certain processing steps are applied to it. The year is extracted from a string variable, column names are cleaned to remove spaces within them, and an inner join is performed to create a third data frame with variables from both datasets for the 325 common firms (here, the assumption is that Firm 1 in dataset 1 is the same as Firm 1 in dataset 2 and so on). A fourth data frame is created with data for the 131 firms that are present in dataset 2 but not in dataset 1. The data frames are explored using methods like info(), describe() and dtypes to obtain a basic overview and statistics from the data. All variables in the data are numeric except the firm number (e.g. ‘Firm 1’) which is later set as the index before running any algorithms on it. These data frames are checked for missing values and we see that they have a 100% fill rate for each variable, with no missing values.
I have conducted an Exploratory Data Analysis (EDA) through various types of data visualisations. My aim through this section was to get a better understanding of the data and create visualisations that focus on the priorities of supervisory resource allocation. I wanted to come up with certain ‘gating rules’ to weed out or identify the important or obviously ‘risky’ firms from the population. These kinds of rules can be used by supervisors while examining dashboards/reports to identify the obvious ‘target’ firms, before delving deeper into the results provided by more sophisticated machine learning algorithms on the data. Given that we primarily have numeric data for the 325 firms, I started with the pairplot which helps us visualise pairwise relationships between variables in our dataset. The pairplot indicates that there are several variables that are highly correlated with each other. Additionally, it is quite clearly visible that there are several outliers in the data as indicated by isolated data points on majority of the plots. However, due to the large number of plots, it is difficult to draw further conclusions from the pairplot, and hence I have analysed separate bivariate scatterplots in later sections of this report. I have then visualised the strength of the relationship between the variables (correlations) using a heatmap. The heatmap depicts direct (positive) correlations between certain variables such as Totalassets(£m) with Totalliabilities(£m), GWP(£m) with NWP(£m), Grossclaimsincurred(£m) with GrossBEL, EoFforSCR(£m) with Totalassets(£m), Totalliabilities(£m), SCR(£m), NWP(£m), GWP(£m), Excessofassetsoverliabilities(£m)(=equity), to name a few. We also see that the Netexpenseratio, Grosscombinedratio, Grossexpenseratio, Puregrossclaimsratio, Purenetclaimsratio are perfectly correlated with each other.
sns.pairplot(df3_long.drop(['NWP_abs_change_YoY',
'NWP_perc_change_YoY', 'SCRCovRatio_abs_change_YoY',
'SCRCovRatio_perc_change_YoY', 'Grossclaimsincur_abs_change_YoY',
'Grossclaimsincur_perc_change_YoY', 'Netcombratio_abs_change_YoY',
'Netcombratio_perc_change_YoY'],axis=1),hue='Year')
fig, ax = plt.subplots(figsize=(30, 20))
sns.heatmap(df3_long.corr(),annot=True,cmap="YlGnBu")
A deeper view into this is achieved through separate bivariate plots. Keeping in mind our first objective of giving more attention to the biggest firms, I have examined indicators of a firm’s size such as their Gross Written Premium, Net Written Premium, Total Assets, Total Liabilities, etc. The correlations from the heatmap serve as evidence to the general understanding that these variables are related to each other. Using plotly and cufflinks, I have created an interactive scatter plot of the total assets (£m) vs. total liabilities (£m) and concluded that firms with total assets or liabilities (in £m) more than 100,000 should be investigated. Based on the availability of resources, this threshold can be tweaked to supervise a larger or smaller number of firms in the corner buckets of these thresholds. Based on this cutoff, the supervisors should pay keen attention to 16 firms, which are Firm 10, Firm 101, Firm 105, Firm 151, Firm 199, Firm 210, Firm 216, Firm 247, Firm 26, Firm 298, Firm 311, Firm 34, Firm 4, Firm 6, Firm 7 and Firm 73. Another indicator of the size of the firm could be its turnover i.e. the Gross Written Premium and derived from it, the Net Written Premium (in £m). From the interactive plot of GWP (£m) vs. NWP (£m), we see that supervisors should focus on firms with GWP (£m) of more than 10,000 or NWP (£m) of more than 5,000. The 25 firms that satisfy these criteria include Firm 10, Firm 101, Firm 104, Firm 105, Firm 120, Firm 131, Firm 151, Firm 17, Firm 199, Firm 210, Firm 23, Firm 234, Firm 247, Firm 25, Firm 26, Firm 276, Firm 30, Firm 311, Firm 34, Firm 38, Firm 4, Firm 51, Firm 6, Firm 7, Firm 73. Between the 16 firms flagged using total assets/total liabilities and 25 firms flagged using GWP/NWP, we see that there is a large overlap of 14 firms flagged by both criteria. Another straight forward way to derive insights based on the firm size, is to identify the top 'X' (10/20/30 and so on) biggest firms for investigation. 'X' can be adjusted based on resource availability. In the year 2020, the top 10 firms based on total revenue written by the insurer (GWP(£m)) were Firm 210, Firm 4, Firm 311, Firm 34, Firm 7, Firm 151, Firm 26, Firm 247, Firm 25 and Firm 73.
Kindly Note: The interactive charts may require the code in the appendix to be run, in case the .ipynb file is being referred to. However, they should be visible in the html file.
df3_long.iplot(kind='scatter',x='Totalassets(£m)',y='Totalliabilities(£m)',mode='markers',title='Total assets (£m) vs. Total liabilities (£m)',
xTitle='Totalassets(£m)',yTitle='Totalliabilities(£m)')
df3_long.iplot(kind='scatter',x='GWP(£m)',y='NWP(£m)',mode='markers',color='purple'
,title='GWP(£m) vs. NWP(£m)',xTitle='Gross Written Premium (£m)',yTitle='Net Written Premium(£m)')
df3_long[df3_long['Year']==2020].sort_values('GWP(£m)',ascending=False)[['Firm','Year','GWP(£m)']].head(10).iplot(kind='scatter',x='Firm',y='GWP(£m)',mode='markers',color='blue', title='Top 10 Largest Firms Based on Total Revenue (GWP(£m)) in the Year 2020',xTitle='Firm',yTitle='GWP(£m)')
In order to derive further insights from the data, I have engineered several features such as the size of buffer (which is the surplus over 100% of the SCR coverage ratio), whether or not the firm is profitable based on its net combined ratio, the absolute and percentage change year-on-year for several key metrics such as Net Written Premium, Gross claims incurred, etc. On analysing a plot of the size of buffer, I see clear anomalies which look more like cases of erroneous reporting of the SCR coverage ratio, and the same has been explored in a later section of the report. In order to evaluate our second priority on the changing business profiles of firms year-on-year, I wanted to identify firms whose data are changing substantially year-on-year. I have leveraged the derived variables such as the absolute and percentage change year-on-year for several key metrics such as Net Written Premium, Gross claims incurred, etc. I have cherry-picked variables based on business knowledge and those that are correlated with several others in the dataset, such that they can capture the main trends in the data. I have constructed a boxplot for each of them to see their distributions and the presence of outliers. The boxplot of the percentage change year-on-year in the SCR coverage ratio is another indicator that supplements our hypothesis of erroneous values in the SCR coverage ratio.
df3_long.iplot(kind='scatter',x='Firm',y='Size_of_Buffer',mode='markers',title='Firm-wise Size of Buffer',
xTitle='Firm',yTitle='Size of Buffer',color='pink')
df3_long[['NWP_abs_change_YoY', 'NWP_perc_change_YoY',
'SCRCovRatio_abs_change_YoY', 'SCRCovRatio_perc_change_YoY',
'Grossclaimsincur_abs_change_YoY', 'Grossclaimsincur_perc_change_YoY',
'Netcombratio_abs_change_YoY', 'Netcombratio_perc_change_YoY']].iplot(kind='box', title='Boxplot of Change Metrics Y-o-Y',
xTitle='Metric',yTitle='Values')
I have primarily used two methods to identify firms that are dynamic and their data are changing significantly. These are the z-score and interquartile range. Through the z-score approach, there are 13 firms (Firm 1, Firm 104, Firm 105, Firm 131, Firm 151, Firm 199, Firm 210, Firm 247, Firm 26, Firm 276, Firm 311, Firm 4, Firm 7) that are flagged as having an absolute change in their Net Written Premium (year-on-year) which is greater than 2.5 standard deviations from the mean. I went ahead to validate the inferences from the z-score analysis by doing a sample study for the flagged Firm 104. It has an unusual spike in NWP(£m) from 63.76 in 2017 to 18,870.46 in 2018 and then a fall back to 1,697.62 in 2019, indicating that there are drastically changing year-on-year trends for this Firm (which are not gradual) and this must be investigated. I have used a similar z-score analysis to identify 18 firms (Firm 105, Firm 112, Firm 158, Firm 17, Firm 200, Firm 216, Firm 22, Firm 234, Firm 25, Firm 261, Firm 275, Firm 283, Firm 286, Firm 37, Firm 49, Firm 52, Firm 74, Firm 96) changing by large amounts based on their gross claims incurred year-on-year. An insight from the raw data study for the highlighted Firm 216 is that we see an unusual dip of Gross claims incurred (£m) from 6,844.01 in 2018 to 0.00 in 2019 and 2020. The same analysis can be done on several other key indicator variables to see which firms are changing year-on-year. An interesting aspect to highlight here is that when looking at a similar z-score analysis on the percentage change in gross claims incurred, a large percentage change may still correspond to a small absolute change in value (for example, from 0.0 in 2019 to 21.45 in 2020 for Firm 29). This may not warrant too much attention of supervisors because of the minor volume. Therefore, we need to take into account the volume as well while looking at the percentage i.e. we must look at variables in conjunction with each other. Therefore, the z-score analysis with a threshold of 2.5, is a good indicator of drastically changing firms. On analysing using the interquartile range, it flagged a large number (~50%) firms as requiring attention based on the absolute change in NWP, and hence the z-score analysis would be a more suitable approach for our business case. Following a firm-wise view, I looked at an aggregate level chart for all firms to see their changes in gross claims incurred over time, and derived another insight that the gross claims incurred over time are reducing, from 2018 to 2020. I validated this conclusion using measures of central tendency such as the mean and median.
df3_long[df3_long['Firm']=='Firm 104'][['Firm','Year','NWP(£m)','NWP_abs_change_YoY']]
df3_long[df3_long['Firm']=='Firm 216'][['Firm','Year','Grossclaimsincurred(£m)','Grossclaimsincur_abs_change_YoY']]
df3_long[df3_long['Firm']=='Firm 29'][['Firm','Year','Grossclaimsincurred(£m)','Grossclaimsincur_perc_change_YoY']]
fig = px.line(df3_long,x='Firm',y='Grossclaimsincurred(£m)',color='Year')
fig.show()
df3_long.groupby('Year')['Grossclaimsincurred(£m)'].mean()
df3_long.groupby('Year')['Grossclaimsincurred(£m)'].median()
The third priority for supervisory resource allocation was to identify outliers from the norm i.e. for a single reporting period, the firms that deviate significantly from the average. As the data visualisations created above indicated the presence of grave outliers in our dataset, I decided to further validate this using machine learning techniques. I have also created bivariate scatter plots for a single reporting period to validate the conclusions drawn from our graphical analyses of total assets (£m) vs. total liabilities (£m) and GWP(£m) vs. NWP(£m), and similar trends are visible here. For machine learning, I have chosen the year 2020 as the period for analysis because through prior discussions and general understanding, I believe that supervisors may tend to focus their attention on the most recent data for firms, and our dataset has 2020 as the latest year. However, the same methods can be replicated on the previous years' data as well. Before we perform machine learning techniques on the data, we need to perform certain preprocessing steps as the variables have different scales, with certain features being in millions, while some others being ratios. This can be achieved through scaling. I have tested the RobustScaler(), StandardScaler() and MinMaxScaler() for this, and proceeded with the MinMaxScaler() as it had a superior performance on our data. The underlying reason for this could be that the robust scaler uses the interquartile range so that it is robust to outliers, and hence the scaled data for outliers and the ‘normal’ data points are not very different, thereby making it harder for the clustering algorithm to separate them into a different cluster. However, the MinMaxScaler() uses the maximum and minimum values for a variable, so the outliers would correspond to the maximum and minimum scaled values, while the ‘normal’ data points will be squished into a narrower range, making it easier for the clustering algorithm to differentiate between them.
df3_long[df3_long['Year']==2016].iplot(kind='scatter',x='Totalassets(£m)',y='Totalliabilities(£m)',mode='markers',color='red'
,title='Total assets (£m) vs. Total liabilities (£m) for the Year 2016',xTitle='Totalassets(£m)',yTitle='Totalliabilities(£m)')
df3_long[df3_long['Year']==2020].iplot(kind='scatter',x='GWP(£m)',y='NWP(£m)',mode='markers',color='green'
,title='GWP(£m) vs. NWP(£m) for the Year 2020',xTitle='Gross Written Premium (£m)',yTitle='Net Written Premium(£m)')
Our data is unlabelled and hence requires unsupervised learning to be used. I decided to apply two machine learning techniques to identify the anomalies – clustering and isolation forest. With respect to clustering, I wanted to analyse whether the algorithm is able to weed out anomalous observations into a few smaller clusters, and these are the firms that may need to be investigated by supervisors. Even though clustering doesn't rely on linear assumptions, and thus collinearity wouldn't cause major issues, it might still lead to overly redundant features, large computational requirements and we may be using more data than required to reach the same patterns. Additionally, as the data in the PRA for supervision can be large, we need to be conscious of ensuring an optimum utilisation of computational resources. Hence, based on business knowledge and high correlations, I dropped a few variables from the dataset. The final variables retained in the dataset for machine learning were GWP(£m), NWP(£m), SCR(£m), SCRcoverageratio, Totalassets(£m), Totalliabilities(£m), GrossBEL(inc.TPsaswhole,pre-TMTP)(£m), Grossclaimsincurred(£m), NetBEL(inc.TPsasawhole,pre-TMTP)(£m), Netcombinedratio, Puregrossclaimsratio, EoFforSCR(£m) and Size_of_Buffer.
The first approach explored by me is k-means clustering, which is a method of partitional clustering. It is faster, more computationally efficient and a provides a good initial view of clusters in our data. An important decision in k-means is the subjective choice of the number of clusters (k) beforehand, which I made using the elbow plot and silhouette coefficient. I tried clustering with k as 3, 4 and 6. In my opinion, the best choice for the value of k for this dataset is 3, as the inertia is low and the number of iterations to converge is half of that at 6 clusters. Additionally, the outlying firms are sitting in two distinct clusters out of 3 (as seen through the firm counts and in the plots). Since our data is multivariate i.e. it has many dimensions, the best way to plot the clusters would be by reducing it to two dimensions using Principal Component Analysis (PCA) and then plotting it. The k-means algorithm flagged 24 firms (out of the 325 input firms) as being grouped in the two ‘anomalous’ clusters.
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 101,
}
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df3_long_2020_scaled2)
sse.append(kmeans.inertia_)
plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
silhouette_coefficients = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df3_long_2020_scaled2)
score = silhouette_score(df3_long_2020_scaled2, kmeans.labels_)
silhouette_coefficients.append(score)
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
results['clusters'].value_counts()
fig = px.scatter(results,x='pca1',y='pca2',color='clusters')
fig.show()
| k | Inertia | No. of Iterations to Converge |
|---|---|---|
| 3 | 19.3 | 6 |
| 4 | 22.5 | 5 |
| 6 | 12.3 | 12 |
I then proceeded to test density-based and hierarchical clustering methods. For density-based clustering, I used DBSCAN (Density Based Spatial Clustering of Applications with Noise). DBSCAN tends to be robust to the presence of noise and outliers in the data, while k-means clustering could be sensitive to outliers. The possible reason why k-means is performing well on our data, is that our aim is to give a differential treatment to outliers (essentially, to identify them), and not treat them as the ‘normal’ data points, which may be desired in most other use cases where k-means is used on a conventional dataset that has outliers. In other words, in k-means, since the cluster centroids are the means, there will be very few observations in the clusters where the 'outlier' firms sit, thereby facilitating their isolation. For DBSCAN, the choice of minPts is subjective but is advised to be 2 times the number of dimensions, and hence I chose a value of 26 (our data has 13 features). I leveraged the distance from the k-nearest neighbours algorithm to choose an optimal value of eps as 0.3. The results from density-based clustering were two clusters, with one containing the outlying firms (27 firms out of the 325 firms input into the model). I validated this by seeing the intersection of the anomalous firms highlighted by k-means, and saw that there is an overlap of 20 firms between the two algorithms.
k_dist = sort_neigh_dist[:, 24]
plt.plot(k_dist)
plt.ylabel("k-NN distance")
plt.xlabel("Sorted observations")
plt.show()
DBSCAN_dataset.Cluster.value_counts().to_frame()
For agglomerative hierarchical clustering, I created the dendrogram to make a choice of the number of clusters, which I chose at 3. The hierarchical clustering algorithm flags 32 firms in the two ‘anomalous’ clusters, which I’d recommend supervisors to pay attention to. Additionally, the firms highlighted through k-means clustering are a perfect subset of those highlighted through hierarchical clustering. Hence, out of the 325 firms in our dataset, 24 firms are flagged by k-means, 27 firms are flagged by DBSCAN and 32 firms are flagged by agglomerative clustering. Out of these, there are 20 firms in common, that are flagged by all three algorithms. Thus, we see that all three algorithms flag a similar set of firms as being in isolated clusters. As described above, for a conventional use case, DBSCAN() is expected to work best with data containing noise and outliers. However, our use case is unique in that it requires outliers to be treated differently from the ‘normal’ data points and hence we get a good performance with k-means as well. Additionally, this dataset is partly anonymised using a random multiplier, which may be distorting certain trends. It is important to note that we might need to tweak parameters like 'k' and ‘eps’, depending on the number of firms that can be supervised based on the availability of scarce resources for supervision.
plt.figure(figsize=(10, 7))
plt.title("Dendrogram")
dend = shc.dendrogram(shc.linkage(df3_long_2020_scaled_hc, method='ward'))
plt.figure(figsize=(10, 7))
plt.title("Dendrogram")
dend = shc.dendrogram(shc.linkage(df3_long_2020_scaled_hc, method='ward'))
plt.axhline(y=2.6, color='r', linestyle='--')
df3_long_2020_scaled_hc['Clusters'].value_counts()
In my next approach, I leveraged the isolation forest algorithm for anomaly detection. In this model, the samples that travel deeper into the tree are less likely to be outliers as they need more cuts to isolate them. However, the samples which end up in shorter branches could indicate anomalies as it was easier for the tree to separate them from other observations. I have used the same dataset for the year 2020 (with 325 firms and 13 features) to fit this model. I have assumed a contamination of 10% in the data, but this parameter is important for the model and can be tweaked as per the number of firms that there is capacity to supervise. The model flags 33 firms as being anomalies. Out of these 33 firms, there is an overlap with 22 (out of 24) flagged by the k-means clustering algorithm, an overlap with 25 (out of 27) flagged by the DBSCAN algorithm and an overlap with 25 (out of 32) flagged by agglomerative clustering.
df3_long_2020_scaled_if['anomaly'].value_counts()
In conclusion, the clustering and isolation forest machine learning techniques perform well in identifying a common pool of 19 firms to bring to the supervisors’ attention. The overlap in results from various algorithms indicates that they are robust and can be used independently as well, depending on the availability of time and resources. If a union of the firms highlighted by the three clustering methods and the isolation forest algorithm is considered, the set consists of 41 firms to investigate, which is a significant reduction in resource consumption compared to that required for the original set of 325 firms. The table below provides a summary of the results from each technique. Such a view can be generated for supervisors after a consensus on the methods to be implemented is reached (as there is a large overlap of results indicating robustness of each technique independently).
| Technique | Number of Firms Flagged for Investigation | Firms Flagged for Investigation |
|---|---|---|
| Graphical Analysis: Total assets (£m) vs. Total liabilities (£m) | 16 | Firm 10,Firm 101,Firm 105,Firm 151,Firm 199,Firm 210,Firm 216,Firm 247,Firm 26,Firm 298,Firm 311,Firm 34,Firm 4,Firm 6, Firm 7,Firm 73 |
| Graphical Analysis: NWP(£m) vs. GWP(£m) | 25 | Firm 10, Firm 101, Firm 104, Firm 105, Firm 120, Firm 131, Firm 151, Firm 17, Firm 199, Firm 210, Firm 23, Firm 234, Firm 247, Firm 25, Firm 26, Firm 276, Firm 30, Firm 311, Firm 34, Firm 38, Firm 4, Firm 51, Firm 6, Firm 7, Firm 73 |
| z-score: Absolute change in NWP (£m) | 13 | Firm 1, Firm 104, Firm 105, Firm 131, Firm 151,Firm 199, Firm 210, Firm 247, Firm 26, Firm 276,Firm 311, Firm 4, Firm 7 |
| z-score: Absolute change in Gross Claims Incurred (£m) | 18 | Firm 105, Firm 112, Firm 158, Firm 17, Firm 200,Firm 216, Firm 22, Firm 234, Firm 25, Firm 261,Firm 275, Firm 283, Firm 286, Firm 37, Firm 49,Firm 52, Firm 74, Firm 96 |
| Clustering: Partitional (K-means) | 24 | Firm 10, Firm 105, Firm 112, Firm 151, Firm 158, Firm 17,Firm 199, Firm 210, Firm 22, Firm 247, Firm 26, Firm 261,Firm 280, Firm 283, Firm 286, Firm 298, Firm 304, Firm 311,Firm 34, Firm 4, Firm 52, Firm 6, Firm 7, Firm 73 |
| Clustering: Density-Based (DBSCAN) | 27 | Firm 10, Firm 105, Firm 112, Firm 127, Firm 151, Firm 158,Firm 166, Firm 17, Firm 199, Firm 210, Firm 22, Firm 228,Firm 26, Firm 270, Firm 283, Firm 284, Firm 286, Firm 304,Firm 311, Firm 34, Firm 4, Firm 52, Firm 6, Firm 7,Firm 72, Firm 73, Firm 99 |
| Clustering: Hierarchical (Agglomerative) | 32 | Firm 10, Firm 105, Firm 112, Firm 151, Firm 158, Firm 17,Firm 188, Firm 199, Firm 208, Firm 210, Firm 22, Firm 247,Firm 26, Firm 261, Firm 276, Firm 280, Firm 283, Firm 286,Firm 295, Firm 298, Firm 30, Firm 304, Firm 311, Firm 34,Firm 4, Firm 51, Firm 52, Firm 6, Firm 7, Firm 73,Firm 74, Firm 81 |
| Isolation Forest | 33 | Firm 10, Firm 105, Firm 112, Firm 127, Firm 151, Firm 158,Firm 166, Firm 17, Firm 188, Firm 199, Firm 210, Firm 22,Firm 228, Firm 234, Firm 247, Firm 25, Firm 26, Firm 280,Firm 283, Firm 284, Firm 295, Firm 298, Firm 30, Firm 304,Firm 311, Firm 34, Firm 4, Firm 52, Firm 6, Firm 7,Firm 72, Firm 73, Firm 99 |
| Common firms flagged by all clustering techniques and isolation forest | 19 | Firm 10, Firm 6, Firm 105, Firm 151, Firm 199, Firm 4, Firm 158, Firm 112, Firm 73, Firm 34, Firm 22, Firm 52, Firm 311, Firm 26, Firm 210, Firm 17, Firm 304, Firm 7, Firm 283 |
| Firms flagged by at least one clustering technique or isolation forest | 41 | Firm 228, Firm 30, Firm 280, Firm 22, Firm 166, Firm 295, Firm 6, Firm 112, Firm 158, Firm 286, Firm 26, Firm 188, Firm 7, Firm 81, Firm 105, Firm 234, Firm 270, Firm 73, Firm 261, Firm 52, Firm 208, Firm 74, Firm 298, Firm 51, Firm 304, Firm 72, Firm 25, Firm 10, Firm 247, Firm 151, Firm 199, Firm 4, Firm 276, Firm 127, Firm 34, Firm 284, Firm 311, Firm 210, Firm 17, Firm 99, Firm 283 |
Kindly Note: The machine learning algorithms are implemented on a subset of data for the Year 2020, while the graphical and z-score analyses are on the entire dataset from 2016 - 2020
In order to create this report using cloud technologies, Microsoft Azure offers a suite of tools that can be used to create an end-to-end data processing and analytics pipeline. In our case, we assume that the data is being batch processed daily.
The first step in the pipeline is data sourcing and collection. This data could be submitted by regulated firms through a data portal in a pre-defined format. The data can be collected using a daily batch ingestion process. Validation checks, as outlined by the supporting taxonomy, should be run on the submission. It is important that data management and governance capabilities are invoked to ensure that the data quality is maintained, to support lineage and discovery and master data and exception management. Additionally, common services such as workflow management, cost management, monitoring, enterprise integration (ETL/ELT), continuous integration, etc. should be abided by so that the data flows are securely and efficiently managed.
The second stage in the pipeline is data storage. Once the data is collected using batch capabilities and has passed through validation checks, it should be securely stored in the data lake storage (MS Azure data lake). The data lake storage is a cloud-based repository for both structured and unstructured data. Here, the data could be stored in its native raw format as it was submitted. Data processing can be invoked on the raw data using Azure Data Factory to ensure that it is transformed and augmented, moving it through the medallion data architecture cycle. Azure Data Factory is a managed cloud service that is built to handle complex extract-transform-and-load (ETL), extract-load-and-transform (ELT) and data integration projects. All data engineering pipelines could be supported by Azure Data Factory, and it could be used as the primary tool for data processing. The metadata should be captured at multiple points to maintain data quality and lineage.
The data processing involves performing several tasks in my report such as converting the dataset from wide to long format, extracting the year from string variables such as ‘2016YE’ and converting it to the date format, applying scaling techniques to the data, creating derived variables, z-score calculation, etc. A batch job can be scheduled in Azure data factory to achieve this. This data that has undergone processing and cleaning, can now be stored in the SQL warehouse to meet analytics requirements at scale. Azure Synapse can be used to provide a workspace for data preparation, data exploration, data modelling and data management. Tasks in my report including grouping the data, filtering for given time periods, checking for null values in the data, describing it, getting the basic statistics from the dataset, replacing infinity values with NaNs, etc. can be done using Synapse. This cleansed data can be accessed through several tools to support data visualisation and artificial intelligence/machine learning techniques.
Data visualisations can be made using Tableau or MS Power BI. The charts in my report can be created as interactive dashboards in these tools. This could give supervisors an idea of the presence of outlying firms in the data that they may want to focus on. They can also tweak thresholds on the graphical analyses to increase or decrease the number of firms supervised, based on the availability of supervisory resources. The machine learning models in my report i.e. clustering and isolation forest can be developed in Azure Machine Learning Studio. We can deploy machine learning batch inference jobs in ML studio. It also has a central monitoring solution for model logs and metrics (the inference data and data drift metrics are stored here). The results from the clustering and isolation forest models can be stored here with the resulting 'target' firms’ data saved as inference data. An alternative to ML studio is Data Bricks which could also be used.
The next stage in the pipeline is the publication of data/reports to stakeholders. The underlying data can be shared with stakeholders through Azure Data Share. The model can be run to generate outputs using Azure Machine Learning studio. This can also be disseminated to stakeholders using a data publication report shared through MS sharepoint online.
The data management and governance requirements on data quality, lineage, data usage and handling and data modelling should underpin every stage of this process. This is a broad pipeline based on my understanding of the dataset, the capabilities provided by MS Azure and the results required to be generated.
From the above analyses, we have seen certain variables that have extremely small or large values, indicating that they may not be outliers, but errors in reporting instead. Errors in reporting could include denomination issues (e.g. misinterpretation of the expected denomination of the reporting entry, such as reporting in thousands rather than in millions), or reporting in the incorrect box or even a misinterpretation of the meaning of a reporting template box. Such errors make the data implausible and we need to query the same with the reporters.
I have done a graphical analysis followed by raw data validation for this. As witnessed through plots of the size of buffer and SCR coverage ratio in the above section, the box plot of the SCR coverage ratio shows an extreme value at 0.9993bn. For a ratio variable that has a median of 1.5875, Q1 of 0.2913 and a Q3 of 3.0612, a value close to 0.9bn is definitely erroneous. I am not considering the mean as it is highly affected by this erroneous value. The raw data shows such values for Firms 131 and 216 in the year 2017, which have been flagged in our graphical analyses and z-score checks done above. The ML techniques were applied on 2020 data, so this datapoint is not present there, else this should have been flagged through the clustering and isolation forest algorithms as well. A firm-wise interactive line graph of the SCR coverage ratio for each year shows us that Firm 320 in the year 2016, Firms 1, 131 and 216 in the year 2017 and Firm 66 in the year 2018 could be cases of erroneous reporting and not just outliers.
df3_long_err['SCRcoverageratio'].iplot(kind='box')
fig = px.line(df3_long_err,x='Firm',y='SCRcoverageratio',color='Year')
fig.show()
A similar graphical analysis of the Gross combined ratio shows a value of 3.98mn for Firm 188 in the year 2019 and a value of 16,023.6 for Firm 28 in the year 2018. For a variable that has a median of 0.1395, a Q1 of 0 and a Q3 of 1, these definitely seem like an erroneous values, and not just outliers in the data. Based on the analysis, I would recommend supervisors to pay careful attention to these firms and possibly request them to correct and resubmit their data.
df3_long_err['Grosscombinedratio'].iplot(kind='box')
fig = px.line(df3_long_err,x='Firm',y='Grosscombinedratio',color='Year')
fig.show()
The Underwriting dataset (data frame 2) has 131 firms that were not present in the general dataset. Hence, the metrics available for these firms are limited. However, I have tried to highlight the firms to be investigated here using a few methods, with the dataset being limited to the year 2020.
A graphical analysis of these variables shows that Firm 449 has an outlying value of -12,033.4 for its Gross combined ratio. Similarly, Firm 418 seems to have anomalous values for the Pure net claims ratio and the net combined ratio. Therefore, the supervisors should pay keen attention to these firms. A bivariate scatter plot of the Gross Best Estimate Liability vs. Net Best Estimate Liability for the Year 2020 shows that firms whose GrossBEL(inc.TPsaswhole,pre-TMTP)(£m) or NetBEL(inc.TPsasawhole,pre-TMTP)(£m) is greater than 1000, should be focused on. The 15 firms flagged here include Firm 329, Firm 337, Firm 347, Firm 349, Firm 351, Firm 359, Firm 362, Firm 387, Firm 393, Firm 397, Firm 403, Firm 427, Firm 428, Firm 429, Firm 451.
fig = px.line(df4_long_2020,x='Firm',y=['GrossBEL(inc.TPsaswhole,pre-TMTP)(£m)',
'Grossclaimsincurred(£m)', 'Grosscombinedratio', 'Grossexpenseratio',
'NetBEL(inc.TPsasawhole,pre-TMTP)(£m)', 'Netcombinedratio',
'Netexpenseratio', 'Puregrossclaimsratio', 'Purenetclaimsratio'])
fig.show()
df4_long_2020.iplot(kind='scatter',x='GrossBEL(inc.TPsaswhole,pre-TMTP)(£m)',y='NetBEL(inc.TPsasawhole,pre-TMTP)(£m)',
mode='markers',color='red'
,title='Gross Best Estimate Liability vs. Net Best Estimate Liability for the Year 2020',xTitle='Gross BEL',yTitle='Net BEL')
In order to perform machine learning techniques i.e. k-means clustering and isolation forest on this dataset, similar considerations for correlated variables and the choice of scaler are used. The final variables in the dataset are GrossBEL(inc.TPsaswhole,pre-TMTP)(£m), Grossclaimsincurred(£m), Grosscombinedratio, Grossexpenseratio, NetBEL(inc.TPsasawhole,pre-TMTP)(£m), Netcombinedratio and Purenetclaimsratio. k-means clustering gives us good results with an inertia of 3.78 for a value of k as 4. Out of the 131 firms in the input data, there are 11 anomalous firms that are clustered into 3 clusters. These are Firm 337, Firm 349, Firm 359, Firm 362, Firm 387, Firm 418, Firm 427, Firm 428, Firm 429, Firm 449, Firm 451. We see a major overlap with the firms highlighted through the graphical analyses. An isolation forest model fit to this data flags 13 firms (out of 131) as anomalies. These are Firm 337, Firm 347, Firm 359, Firm 362, Firm 364, Firm 387, Firm 403, Firm 418, Firm 427, Firm 428, Firm 449, Firm 451, Firm 455. Again, we see a major overlap of 9 firms (out of 11) with the k-means clustering algorithm.
We should note here that the firms obtained as outliers through the unsupervised machine learning algorithms are in line with those flagged in the graphical analyses. While this is just a small subsample, the same analysis can be scaled to a larger number of firms and the number of clusters in k-means or the contamination parameter in isolation forest can be adjusted accordingly.
fig, ax = plt.subplots(figsize=(30, 20))
sns.heatmap(df4_long.corr(),annot=True,cmap="YlGnBu")
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 101,
}
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df4_long_2020_c_scaled)
sse.append(kmeans.inertia_)
plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
silhouette_coefficients = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df4_long_2020_c_scaled)
score = silhouette_score(df4_long_2020_c_scaled, kmeans.labels_)
silhouette_coefficients.append(score)
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
results['clusters'].value_counts()
fig = px.scatter(results,x='pca1',y='pca2',color='clusters')
fig.show()
df4_long_2020_c_scaled['anomaly'].value_counts()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
import plotly.express as px
import datetime
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from kneed import KneeLocator
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
import scipy.cluster.hierarchy as shc
from sklearn.ensemble import IsolationForest
from scipy import stats
%matplotlib inline
cf.go_offline()
The two datasets (one general and the other for underwriting) are provided in two different sheets in the same excel file. The data is in the wide format such that for each Firm, each variable is repeated 5 times (once for each year from 2016 - 2020). This is then converted to the long format such that each Firm now has 5 rows (one for each year from 2016 - 2020).
# Using MultiIndexing as we have sub columns
df1_wide = pd.read_excel('DataScientist_009749_Dataset.xlsx', sheet_name = 'Dataset 1 - General', header = [0, 1])
df1_wide.head()
# Using MultiIndexing as we have sub columns
df2_wide = pd.read_excel('DataScientist_009749_Dataset.xlsx', sheet_name = 'Dataset 2 - Underwriting', header = [0, 1])
df2_wide.head()
# Convert the data from wide to long format (df1)
df1_long = df1_wide.stack(level=1).reset_index().rename(columns={"level_0": "Firm", "level_1":"Year"})
df1_long['Year'] = df1_long['Year'].apply(lambda x:x[:4])
df1_long = df1_long.sort_values(['Firm','Year'])
df1_long.columns = df1_long.columns.str.replace(' ', '')
df1_long['Year'] = pd.to_datetime(df1_long['Year']).dt.year
df1_long.head()
# Applying the same data formatting to df2
# Convert the data from wide to long format (df2)
df2_long = df2_wide.stack(level=1).reset_index().rename(columns={"level_0": "Firm", "level_1":"Year"})
df2_long['Year'] = df2_long['Year'].apply(lambda x:x[:4])
df2_long = df2_long.sort_values(['Firm','Year'])
df2_long.columns = df2_long.columns.str.replace(' ', '')
df2_long['Year'] = pd.to_datetime(df2_long['Year']).dt.year
df2_long.head()
df1_long.info()
df1_long['Firm'].nunique()
df1_long.describe()
df1_long.dtypes
df2_long.info()
df2_long['Firm'].nunique()
df2_long.describe()
df2_long.dtypes
Checking for missing data - we find out that there is no missing data in either of the data frames.
df1_long.isnull().sum()
df2_long.isnull().sum()
# Assuming that the Firm Numbers indicate the same firms in both datasets - doing an inner join
df3_long = pd.merge(df1_long,df2_long,on=['Firm','Year'],how='inner')
df3_long.shape
df3_long['Firm'].nunique()
df3_long.info()
df3_long.describe()
df3_long.dtypes
df3_long.isnull().sum()
df4_long = df2_long[~df2_long.Firm.isin(df1_long['Firm'])]
df4_long.shape
df4_long['Firm'].nunique()
df4_long.info()
df4_long.describe()
df4_long.dtypes
df4_long.isnull().sum()
The data frames are explored using methods like info(), describe(), dtypes. All variables in the data are numeric except the firm number (e.g. ‘Firm 1’) which is later set as the index before running any algorithms on it.
The reasons for creating derived variables for these metrics is to see how firms are changing year-on-year. As we will see ahead, these metrics are correlated with several other features and hence their trends also would be captured through these metrics.
# Creating a copy of df3_long
df3_long_copy = df3_long.copy()
df3_long['Size_of_Buffer'] = df3_long['SCRcoverageratio'] - 1
df3_long['Profitable_Binary'] = df3_long['Netcombinedratio'].apply(lambda x:1 if x<1 else 0)
df3_long['NWP_abs_change_YoY'] = df3_long.groupby('Firm')['NWP(£m)'].diff().fillna(0, downcast='int')
df3_long['NWP_perc_change_YoY'] = (df3_long.groupby('Firm')['NWP(£m)'].pct_change().fillna(0, downcast='int'))*100
df3_long['SCRCovRatio_abs_change_YoY'] = df3_long.groupby('Firm')['SCRcoverageratio'].diff().fillna(0, downcast='int')
df3_long['SCRCovRatio_perc_change_YoY'] = (df3_long.groupby('Firm')['SCRcoverageratio'].pct_change().fillna(0, downcast='int'))*100
df3_long['Grossclaimsincur_abs_change_YoY'] = df3_long.groupby('Firm')['Grossclaimsincurred(£m)'].diff().fillna(0, downcast='int')
df3_long['Grossclaimsincur_perc_change_YoY'] = (df3_long.groupby('Firm')['Grossclaimsincurred(£m)'].pct_change().fillna(0, downcast='int'))*100
df3_long['Netcombratio_abs_change_YoY'] = df3_long.groupby('Firm')['Netcombinedratio'].diff().fillna(0, downcast='int')
df3_long['Netcombratio_perc_change_YoY'] = (df3_long.groupby('Firm')['Netcombinedratio'].pct_change().fillna(0, downcast='int'))*100
# Replacing the first value within each firm to be NaN as it doesn't have a preceeding value
df3_long.loc[df3_long.groupby('Firm')['NWP_abs_change_YoY'].head(1).index, 'NWP_abs_change_YoY'] = np.NaN
df3_long.loc[df3_long.groupby('Firm')['NWP_perc_change_YoY'].head(1).index, 'NWP_perc_change_YoY'] = np.NaN
df3_long.loc[df3_long.groupby('Firm')['SCRCovRatio_abs_change_YoY'].head(1).index, 'SCRCovRatio_abs_change_YoY'] = np.NaN
df3_long.loc[df3_long.groupby('Firm')['SCRCovRatio_perc_change_YoY'].head(1).index, 'SCRCovRatio_perc_change_YoY'] = np.NaN
df3_long.loc[df3_long.groupby('Firm')['Grossclaimsincur_abs_change_YoY'].head(1).index, 'Grossclaimsincur_abs_change_YoY'] = np.NaN
df3_long.loc[df3_long.groupby('Firm')['Grossclaimsincur_perc_change_YoY'].head(1).index, 'Grossclaimsincur_perc_change_YoY'] = np.NaN
df3_long.loc[df3_long.groupby('Firm')['Netcombratio_abs_change_YoY'].head(1).index, 'Netcombratio_abs_change_YoY'] = np.NaN
df3_long.loc[df3_long.groupby('Firm')['Netcombratio_perc_change_YoY'].head(1).index, 'Netcombratio_perc_change_YoY'] = np.NaN
# Replacing the inf generated due to division by 0, by NaN values
df3_long.replace([np.inf, -np.inf], np.nan, inplace=True)
df3_long[['Firm','Year','NWP(£m)','Grossclaimsincurred(£m)','NWP_abs_change_YoY', 'Grossclaimsincur_perc_change_YoY']].head(20)
I have conducted an Exploratory Data Analysis (EDA) through various types of data visualisations. My aim through this section was to get a better understanding of the data and create visualisations that focus on the priorities of supervisory resource allocation. I wanted to come up with certain ‘gating rules’ to weed out or identify the important or obviously ‘risky’ firms from the population. These kinds of rules can be used by supervisors while examining dashboards/reports to identify the obvious ‘target’ firms, before delving deeper into the results provided by more sophisticated machine learning algorithms on the data.
Since we have numeric (continuous) data, we will start with pairplots using the seaborn library. This helps us visualise the pairwise relationships between variables in our dataset.
#import warnings
#warnings.filterwarnings("ignore")
sns.pairplot(df1_long,hue='Year')
sns.pairplot(df2_long,hue='Year')
sns.pairplot(df3_long.drop(['NWP_abs_change_YoY',
'NWP_perc_change_YoY', 'SCRCovRatio_abs_change_YoY',
'SCRCovRatio_perc_change_YoY', 'Grossclaimsincur_abs_change_YoY',
'Grossclaimsincur_perc_change_YoY', 'Netcombratio_abs_change_YoY',
'Netcombratio_perc_change_YoY'],axis=1),hue='Year')
I have created a pairplot for df1 and df2 individually, and then for the joint data frame df3, as there could be a relationship between variables between the two datasets. We see that several variables have correlations with each other. Additionally, it is quite clearly visible that there are several outliers in the data as indicated by anomalous data points on majority of the plots. We can delve deeper by making bivariate scatterplots to examine the relationships between these variables. I have also explored a heatmap with annotations to quantify the relationship between the variables.
fig, ax = plt.subplots(figsize=(30, 20))
sns.heatmap(df3_long.corr(),annot=True,cmap="YlGnBu")
From this heatmap, it is clearly visible that the following variables are directly correlated:
- Totalassets(£m) with Totalliabilities(£m)
- EoFforSCR(£m) with Totalassets(£m), Totalliabilities(£m), SCR(£m), NWP(£m), GWP(£m), Excessofassetsoverliabilities(£m)[=equity]
- GWP(£m) with NWP(£m)
- Grossclaimsincurred(£m) with GrossBEL
- Netexpenseratio, Grosscombinedratio, Grossexpenseratio, Puregrossclaimsratio, Purenetclaimsratio with each other (perfect correlation)
to name a few.
Indicators of a firm's size could be their Gross Written Premium, Net Written Premium, Total Assets, Total Liabilities, to name a few. As we see from the heatmap, these variables are positively correlated as well.
# Using plotly and cufflinks
df3_long.iplot(kind='scatter',x='Totalassets(£m)',y='Totalliabilities(£m)',mode='markers',title='Total assets (£m) vs. Total liabilities (£m)',
xTitle='Totalassets(£m)',yTitle='Totalliabilities(£m)')
This plot is created using plotly and cufflinks so as to make it interactive. Note that each firm occurs multiple times on this plot (due to multiple rows – one for each year). From this, we can infer that firms with total assets > 100K or total liabilities > 100K should be given more attention to. The following are the unique firms that satisfy this criteria:
firms_size_assets_liab = df3_long[(df3_long['Totalassets(£m)']>100000) | (df3_long['Totalliabilities(£m)']>100000)]['Firm'].unique()
print(firms_size_assets_liab)
len(firms_size_assets_liab)
# Adding a regression line to the above plot for data exploration
sns.jointplot(x='Totalassets(£m)',y='Totalliabilities(£m)',data=df3_long, kind='reg',)
# Using NWP(£m) and GWP(£m)
df3_long.iplot(kind='scatter',x='GWP(£m)',y='NWP(£m)',mode='markers',color='purple'
,title='GWP(£m) vs. NWP(£m)',xTitle='Gross Written Premium (£m)',yTitle='Net Written Premium(£m)')
On zooming into this interactive plot, we see that there are big firms that need to be focused on where GWP(£m) > 10K or NWP(£m) > 5K
firms_size_gwp_nwp = df3_long[(df3_long['GWP(£m)']>10000) | (df3_long['NWP(£m)']>5000)]['Firm'].unique()
firms_size_gwp_nwp
len(firms_size_gwp_nwp)
print(list(np.intersect1d(firms_size_assets_liab, firms_size_gwp_nwp)) )
len(list(np.intersect1d(firms_size_assets_liab, firms_size_gwp_nwp)) )
df3_long[df3_long['Year']==2020].sort_values('GWP(£m)',ascending=False)[['Firm','Year','GWP(£m)']].head(10)
df3_long[df3_long['Year']==2019].sort_values('GWP(£m)',ascending=False)[['Firm','Year','GWP(£m)']].head(10)
df3_long.iplot(kind='scatter',x='Firm',y='GWP(£m)',mode='markers',title='Firm-wise Gross Written Premium(£m)',
xTitle='Firm',yTitle='GWP(£m)',color='blue')
From this, we can infer that firms with turnover i.e. GWP (£m) > 10K should be given more attention to. The following are the unique firms that satisfy this criteria:
firms_size_gwp = df3_long[(df3_long['GWP(£m)']>10000)]['Firm'].unique()
firms_size_gwp
df3_long.iplot(kind='scatter',x='Firm',y='Size_of_Buffer',mode='markers',title='Firm-wise Size of Buffer',
xTitle='Firm',yTitle='Size of Buffer',color='pink')
The size of buffer is derived from the SCR coverage ratio. There are clear anomalies here which may actually be a case of erroneous reporting which I have explored later.
The threshold below may provide a mix of anomalies and erroneous reporting. On increasing it, we can obtain the firms that have incorrectly reported the SCR coverage ratio (explored ahead).
firms_size_sizeofbuffer = df3_long[(df3_long['Size_of_Buffer']>10000)]['Firm'].unique()
firms_size_sizeofbuffer
print(list(np.intersect1d(firms_size_assets_liab, firms_size_gwp_nwp)) )
Based on the availability of resources, the above thresholds can be tweaked to see firms sitting in buckets just below the ones defined above.
df3_long[['NWP_abs_change_YoY', 'NWP_perc_change_YoY',
'SCRCovRatio_abs_change_YoY', 'SCRCovRatio_perc_change_YoY',
'Grossclaimsincur_abs_change_YoY', 'Grossclaimsincur_perc_change_YoY',
'Netcombratio_abs_change_YoY', 'Netcombratio_perc_change_YoY']].iplot(kind='box', title='Boxplot of Change Metrics Y-o-Y',
xTitle='Metric',yTitle='Values')
Another indicator of SCR Coverage ratio having erroneous values.
df3_long['NWP_abs_change_YoY_z'] = stats.zscore(df3_long['NWP_abs_change_YoY'],nan_policy='omit')
z_thresh = 2.5
np.sum(np.abs(df3_long['NWP_abs_change_YoY_z']) > z_thresh)
df3_long[(np.abs(df3_long['NWP_abs_change_YoY_z']) > z_thresh)]['Firm'].nunique()
df3_long[np.abs(df3_long['NWP_abs_change_YoY_z']) > z_thresh][['Firm','Year','NWP(£m)','NWP_abs_change_YoY']]['Firm'].unique()
df3_long[df3_long['Firm']=='Firm 104'][['Firm','Year','NWP(£m)','NWP_abs_change_YoY']]
df3_long['Grossclaimsincur_abs_change_YoY_z'] = stats.zscore(df3_long['Grossclaimsincur_abs_change_YoY'],nan_policy='omit')
z_thresh = 2.5
np.sum(np.abs(df3_long['Grossclaimsincur_abs_change_YoY_z']) > z_thresh)
df3_long[np.abs(df3_long['Grossclaimsincur_abs_change_YoY_z']) > z_thresh]['Firm'].unique()
df3_long[np.abs(df3_long['Grossclaimsincur_abs_change_YoY_z']) > z_thresh][['Firm','Year','Grossclaimsincurred(£m)','Grossclaimsincur_abs_change_YoY']]
df3_long[df3_long['Firm']=='Firm 216'][['Firm','Year','Grossclaimsincurred(£m)','Grossclaimsincur_abs_change_YoY']]
df3_long['Grossclaimsincur_perc_change_YoY_z'] = stats.zscore(df3_long['Grossclaimsincur_perc_change_YoY'],nan_policy='omit')
z_thresh = 2.5
np.sum(np.abs(df3_long['Grossclaimsincur_perc_change_YoY_z']) > z_thresh)
The following firms have unusal movements based on percentage changes in the Grossclaimsincurred(£m):
df3_long[np.abs(df3_long['Grossclaimsincur_perc_change_YoY_z']) > z_thresh][['Firm','Year','Grossclaimsincurred(£m)','Grossclaimsincur_perc_change_YoY_z']]
df3_long[df3_long['Firm']=='Firm 29'][['Firm','Year','Grossclaimsincurred(£m)','Grossclaimsincur_perc_change_YoY']]
Even though there is a large percentage change in the gross claims incurred, the values are still small and hence these firms may not actually require deep investigation as compared to some others. We need to take into account the volume as well while looking at the percentage i.e. look at variables in conjunction with each other.
Identifying firms with major changes Y-o-Y through graphs:
df3_long.iplot(kind='scatter',x='Firm',y='Grossclaimsincur_abs_change_YoY',mode='markers',title='Firm-wise Gross Claims Incurred - Absolute Change Y-o-Y',
xTitle='Firm',yTitle='Gross Claims Incurred Absolute change Y-o-Y',color='blue')
### Using IQR to identify these massive changes in firms
# Firms changing based on IQR
# Checking based on NWP first
out_low_NWP_abs_change_YoY = df3_long['NWP_abs_change_YoY'].quantile(0.25) - 1.5 * (df3_long['NWP_abs_change_YoY'].quantile(0.75) - df3_long['NWP_abs_change_YoY'].quantile(0.25))
out_high_NWP_abs_change_YoY = df3_long['NWP_abs_change_YoY'].quantile(0.75) + 1.5 * (df3_long['NWP_abs_change_YoY'].quantile(0.75) - df3_long['NWP_abs_change_YoY'].quantile(0.25))
np.sum((df3_long.loc[:, 'NWP_abs_change_YoY'] < out_low_NWP_abs_change_YoY) | (df3_long.loc[:, 'NWP_abs_change_YoY'] > out_high_NWP_abs_change_YoY))
df3_long[(df3_long.loc[:, 'NWP_abs_change_YoY'] < out_low_NWP_abs_change_YoY) | (df3_long.loc[:, 'NWP_abs_change_YoY'] > out_high_NWP_abs_change_YoY)]['Firm'].nunique()
(df3_long[(df3_long.loc[:, 'NWP_abs_change_YoY'] < out_low_NWP_abs_change_YoY) | (df3_long.loc[:, 'NWP_abs_change_YoY'] > out_high_NWP_abs_change_YoY)]['Firm'].nunique())/(df3_long['Firm'].nunique())
The IQR may not be the best method to use here as it flags ~50% of the firms to investigate.
fig = px.line(df3_long,x='Firm',y='Grossclaimsincurred(£m)',color='Year')
fig.show()
From the interactive chart for gross claims incurred, we see that gross claims incurred are reducing over time, from 2018 to 2020. Validating the same with the mean and median below.
df3_long.groupby('Year')['Grossclaimsincurred(£m)'].mean()
df3_long.groupby('Year')['Grossclaimsincurred(£m)'].median()
# Using seaborn
sns.jointplot(x='EoFforSCR(£m)',y='SCRcoverageratio',data=df3_long, kind='scatter')
We see clear anomalies at high values of EoFforSCR(£m) and SCRCoverageRatio.
Examining Total assets (£m) vs. Total liabilities (£m) for the Year 2016
df3_long[df3_long['Year']==2016].iplot(kind='scatter',x='Totalassets(£m)',y='Totalliabilities(£m)',mode='markers',color='red'
,title='Total assets (£m) vs. Total liabilities (£m) for the Year 2016',xTitle='Totalassets(£m)',yTitle='Totalliabilities(£m)')
From this, we can infer that firms with total assets > 100K or total liabilities > 100K should be given more attention to. The following are the unique firms that satisfy this criteria. These firms are a subset of the ones we saw above in a similar graph for all years:
print(df3_long[(df3_long['Year']==2016) & ((df3_long['Totalassets(£m)']>100000) | (df3_long['Totalliabilities(£m)']>100000))]['Firm'].unique())
Examining GWP(£m) vs. NWP(£m) for the Year 2020
df3_long[df3_long['Year']==2020].iplot(kind='scatter',x='GWP(£m)',y='NWP(£m)',mode='markers',color='green'
,title='GWP(£m) vs. NWP(£m) for the Year 2020',xTitle='Gross Written Premium (£m)',yTitle='Net Written Premium(£m)')
On zooming into this interactive plot, we see that there are clear anomalies or firms that need to be focused on where at higher values of GWP(£m) and NWP(£m).
Boxplots to detect outliers
df3_long[['GWP(£m)','NWP(£m)','SCR(£m)','EoFforSCR(£m)','SCRcoverageratio','Totalassets(£m)','Totalliabilities(£m)','Excessofassetsoverliabilities(£m)[=equity]','GrossBEL(inc.TPsaswhole,pre-TMTP)(£m)','NetBEL(inc.TPsasawhole,pre-TMTP)(£m)','Grossclaimsincurred(£m)']].iplot(kind='box')
This points to anomalous/erroneous values in the SCRcoverageratio. I will check this in detail in the erroneous reporting section.
df3_long[['Grosscombinedratio','Grossexpenseratio','Netcombinedratio','Netexpenseratio','Puregrossclaimsratio','Purenetclaimsratio']].iplot(kind='box')
Using matplotlib - Plotting using other visualisation libraries
x = df3_long['Firm']
y = df3_long['EoFforSCR(£m)']
plt.figure(figsize=(10,6))
plt.plot(x, y)
plt.title('Firm-wise EoF for SCR (£m)')
plt.xlabel('Firm Number')
plt.ylabel('EoF for SCR (£m)')
The reason for choosing 2020 as the single period for analysis, is that based on prior knowledge and discussions, supervisors may tend to focus their attention on the most recent data of firms. Since our dataset has 2020 as the most recent year, I have selected this for my analysis. However, the same can be replicated on the previous years' data as well.
Before we proceed with Machine Learning techniques on the data to identify firms deviating from the norm, we need to perform certain preprocessing as the variables have different scales. This can be achieved through scaling.
# Copy of df3_long
df3_long_copy2 = df3_long.copy()
# Break df3_long year-wise
df3_long_2020 = df3_long[df3_long['Year']==2020]
df3_long_2019 = df3_long[df3_long['Year']==2019]
df3_long_2018 = df3_long[df3_long['Year']==2018]
df3_long_2017 = df3_long[df3_long['Year']==2017]
df3_long_2016 = df3_long[df3_long['Year']==2016]
df3_long_2020.shape
# Setting the index of the dataframe as the firm and dropping the year column
df3_long_2020 = df3_long_2020.set_index('Firm')
df3_long_2020 = df3_long_2020.drop('Year',axis=1)
df3_long_2020.head()
df3_long_2020.shape
I have tested the RobustScaler(), StandardScaler() and MinMaxScaler() for this, with the MinMaxScaler() giving a better performance for identifying anomalous clusters through the clustering algorithm. The underlying reason for this could be that the robust scaler uses the interquartile range so that it is robust to outliers, and hence the scaled data for outliers and the ‘normal’ data points are not very different, thereby making it harder for the clustering algorithm to separate them into a different cluster. However, the MinMaxScaler() uses the maximum and minimum values for the variable, so the outliers would correspond to the maximum and minimum scaled values, while the ‘normal’ data points will be squished into a narrower range, making it easier for the clustering algorithm to differentiate them.
# Using MinMaxScaler()
scaler = preprocessing.MinMaxScaler()
df3_long_2020_scaled = scaler.fit_transform(df3_long_2020)
df3_long_2020_scaled.shape
df3_long_2020_scaled = pd.DataFrame(df3_long_2020_scaled, columns=df3_long_2020.columns, index = df3_long_2020.index)
df3_long_2020_scaled.head(5)
Even though clustering doesn't rely on linear assumptions, and so collinearity wouldn't cause major issues, it may lead to overly redundant features, large computational requirements and we may be using more data than required to reach the same patterns. Additionally, as the data in the PRA for supervision can be very large, we need to be conscious of ensuring an optimum utilisation of computational resources.
fig, ax = plt.subplots(figsize=(30, 20))
sns.heatmap(df3_long.corr(),annot=True,cmap="YlGnBu")
From the ratios, I am retaining netcombinedratio and Puregrossclaims ratio. I am dropping the derived variables on year-on-year changes and the excess of assets over liabilities
df3_long_2020_scaled2 = df3_long_2020_scaled.copy()
df3_long_2020_scaled2 = df3_long_2020_scaled2.drop(['Grosscombinedratio','Grossexpenseratio','Netexpenseratio','Purenetclaimsratio',
'Excessofassetsoverliabilities(£m)[=equity]','NWP_abs_change_YoY',
'NWP_perc_change_YoY', 'SCRCovRatio_abs_change_YoY','SCRCovRatio_perc_change_YoY',
'Grossclaimsincur_abs_change_YoY',
'Grossclaimsincur_perc_change_YoY', 'Netcombratio_abs_change_YoY',
'Netcombratio_perc_change_YoY', 'NWP_abs_change_YoY_z',
'Grossclaimsincur_abs_change_YoY_z','Grossclaimsincur_perc_change_YoY_z',
'Profitable_Binary'],axis=1)
df3_long_2020_scaled2.columns
df3_long_2020_scaled2.shape
I started with k-means clustering which is a method of partitional clustering. It is faster and more computationally efficient and provides a good view of our data.
An important decision in K-means is the choice of 'k' beforehand. I am examining this through the elbow plot and silhouette coefficient.
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 101,
}
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df3_long_2020_scaled2)
sse.append(kmeans.inertia_)
The graph shows an elbow point at 3 and 6 clusters, but does not completely flatten. However, it is not a clear elbow, making the choice of the number of clusters a subjective decision.
plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
kl = KneeLocator(range(1, 11), sse, curve="convex", direction="decreasing")
kl.elbow
# Using the silhouette coefficient
silhouette_coefficients = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df3_long_2020_scaled2)
score = silhouette_score(df3_long_2020_scaled2, kmeans.labels_)
silhouette_coefficients.append(score)
The best choice for k is 3, 4 or 6 since they have high silhouette scores
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
# Fitting the k-means algorithm to our data
kmeans = KMeans(
init="random",
n_clusters=3,
n_init=10,
max_iter=300,
random_state=101)
kmeans.fit(df3_long_2020_scaled2)
The kmeans.inertia_ gives us the lowest SSE value i.e. the lower the model inertia, the better the fit. An inertia of 19.29 indicates a good fit.
kmeans.inertia_
# Final locations of the centroid
kmeans.cluster_centers_
# The number of iterations required to converge
kmeans.n_iter_
kmeans.labels_
Our data is multivariate i.e. it has many dimensions. The best way to plot the clusters would be by reducing it to two dimensions using Principal Component Analysis (PCA) and then plotting it.
df3_long_2020_scaled2['clusters'] = kmeans.fit_predict(df3_long_2020_scaled2)
pca_num_components = 2
reduced_data = PCA(n_components=pca_num_components).fit_transform(df3_long_2020_scaled2)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'],index=df3_long_2020_scaled2.index)
results['clusters'] = df3_long_2020_scaled2['clusters']
results['clusters'].value_counts()
fig = px.scatter(results,x='pca1',y='pca2',color='clusters')
fig.show()
Identifying firms in Clusters 0 and 1 to investigate
df3_long_2020_scaled2[(df3_long_2020_scaled2['clusters']==0) | (df3_long_2020_scaled2['clusters']==1)]
| k | Inertia | No. of Iterations to Converge |
|---|---|---|
| 3 | 19.3 | 6 |
| 4 | 22.5 | 5 |
| 6 | 12.3 | 12 |
I tried clustering with k as 3,4 and 6. The best choice of k is 3 as inertia is not very high and number of iterations to converge is half of that at 6 clusters. Also, the outlying firms are sitting in two distinct clusters out of 3 (as seen above). Proceeding to see whether density based and hierarchical clustering has different results.
clus_kmeans_firms = (df3_long_2020_scaled2[(df3_long_2020_scaled2['clusters']==0) | (df3_long_2020_scaled2['clusters']==1)]).index
clus_kmeans_firms
The k-means algorithm flagged 24 firms (out of the 325 input firms) sitting in the two ‘anomalous’ clusters.
len(clus_kmeans_firms)
DBSCAN tends to be robust to the presence of noise and outliers in the data, while k-means clustering could be sensitive to outliers. The possible reason why k-means is performing well on our data, is that our aim is to give a differential treatment to outliers (essentially, to identify them), and not treat them as the ‘normal’ data points, which may be desired in most other use cases where k-means is used on a conventional dataset that has outliers. In other words, in k-means, since the cluster centroids are the means, there will be very few observations in the clusters where the 'outlier' firms sit, thereby facilitating their isolation.
Removing derived columns and highly correlated columns from the data
df3_long_2020_db = df3_long_2020.copy()
Using MinMaxScaler()
scaler = preprocessing.MinMaxScaler()
df3_long_2020_scaled_db = scaler.fit_transform(df3_long_2020_db)
df3_long_2020_scaled_db.shape
df3_long_2020_scaled_db = pd.DataFrame(df3_long_2020_scaled_db, columns=df3_long_2020_db.columns, index = df3_long_2020_db.index)
df3_long_2020_scaled_db.head(2)
df3_long_2020_scaled_db = df3_long_2020_scaled_db.drop(['Grosscombinedratio','Grossexpenseratio','Netexpenseratio','Purenetclaimsratio',
'Excessofassetsoverliabilities(£m)[=equity]','NWP_abs_change_YoY',
'NWP_perc_change_YoY', 'SCRCovRatio_abs_change_YoY','SCRCovRatio_perc_change_YoY',
'Grossclaimsincur_abs_change_YoY',
'Grossclaimsincur_perc_change_YoY', 'Netcombratio_abs_change_YoY',
'Netcombratio_perc_change_YoY', 'NWP_abs_change_YoY_z',
'Grossclaimsincur_abs_change_YoY_z','Grossclaimsincur_perc_change_YoY_z',
'Profitable_Binary'],axis=1)
df3_long_2020_scaled_db.shape
For DBSCAN, the choice of minPts and E is a subjective one, however advised as the following:
- minPts: The minimum number of points (a threshold) clustered together for a region to be considered dense. For multidimensional dataset, minPts should be 2 * number of dimensions. Here, minPts = 2 x 13 = 26
- eps (ε): A distance measure that will be used to locate the points in the neighborhood of any point.
Choosing the optimal value of E (Using KNN)
from sklearn.neighbors import NearestNeighbors
# n_neighbors = 26 (taking odd number) as kneighbors function returns distance of point to itself
nbrs = NearestNeighbors(n_neighbors = 25).fit(df3_long_2020_scaled_db)
# Find the k-neighbors of a point
neigh_dist, neigh_ind = nbrs.kneighbors(df3_long_2020_scaled_db)
# sort the neighbor distances (lengths to points) in ascending order
sort_neigh_dist = np.sort(neigh_dist, axis = 0)
k_dist = sort_neigh_dist[:, 24]
plt.plot(k_dist)
plt.ylabel("k-NN distance")
plt.xlabel("Sorted observations")
plt.show()
kneedle = KneeLocator(x = range(1, len(neigh_dist)+1), y = k_dist, S = 1.0,
curve = "concave", direction = "increasing", online=True)
print(kneedle.knee_y)
kneedle.plot_knee()
clustering = DBSCAN(eps=0.3, min_samples=26).fit(df3_long_2020_scaled_db)
DBSCAN_dataset = df3_long_2020_scaled_db.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_
DBSCAN_dataset.Cluster.value_counts().to_frame()
# Similar outlying firms as obtained by the K-means algorithm
DBSCAN_dataset[DBSCAN_dataset['Cluster']==-1]
clus_dbscan_firms = (DBSCAN_dataset[DBSCAN_dataset['Cluster']==-1]).index
clus_dbscan_firms
The results from density-based clustering are two clusters, with one containing the outlying firms (27 firms out of the 325 firms input into the model).
len(clus_dbscan_firms)
print(list(np.intersect1d(clus_kmeans_firms, clus_dbscan_firms)) )
Validating this by seeing the intersection of the anomalous firms highlighted by k-means and DBSCAN, to see that there is an overlap of 20 firms between the two algorithms.
len(list(np.intersect1d(clus_kmeans_firms, clus_dbscan_firms)) )
df3_long_2020_hc = df3_long_2020.copy()
Using MinMaxScaler()
scaler = preprocessing.MinMaxScaler()
df3_long_2020_scaled_hc = scaler.fit_transform(df3_long_2020_hc)
df3_long_2020_scaled_hc.shape
df3_long_2020_scaled_hc = pd.DataFrame(df3_long_2020_scaled_hc, columns=df3_long_2020_hc.columns, index = df3_long_2020_hc.index)
df3_long_2020_scaled_hc.head(2)
df3_long_2020_scaled_hc = df3_long_2020_scaled_hc.drop(['Grosscombinedratio','Grossexpenseratio','Netexpenseratio','Purenetclaimsratio',
'Excessofassetsoverliabilities(£m)[=equity]','NWP_abs_change_YoY',
'NWP_perc_change_YoY', 'SCRCovRatio_abs_change_YoY','SCRCovRatio_perc_change_YoY',
'Grossclaimsincur_abs_change_YoY',
'Grossclaimsincur_perc_change_YoY', 'Netcombratio_abs_change_YoY',
'Netcombratio_perc_change_YoY', 'NWP_abs_change_YoY_z',
'Grossclaimsincur_abs_change_YoY_z','Grossclaimsincur_perc_change_YoY_z',
'Profitable_Binary'],axis=1)
plt.figure(figsize=(10, 7))
plt.title("Dendrogram")
dend = shc.dendrogram(shc.linkage(df3_long_2020_scaled_hc, method='ward'))
plt.figure(figsize=(10, 7))
plt.title("Dendrogram")
dend = shc.dendrogram(shc.linkage(df3_long_2020_scaled_hc, method='ward'))
plt.axhline(y=2.6, color='r', linestyle='--')
cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
cluster.fit_predict(df3_long_2020_scaled_hc)
df3_long_2020_scaled_hc['Clusters'] = cluster.labels_
df3_long_2020_scaled_hc['Clusters'].value_counts()
df3_long_2020_scaled_hc[(df3_long_2020_scaled_hc['Clusters']==2) | (df3_long_2020_scaled_hc['Clusters']==0)]
clus_hier_firms = (df3_long_2020_scaled_hc[(df3_long_2020_scaled_hc['Clusters']==2) | (df3_long_2020_scaled_hc['Clusters']==0)]).index
clus_hier_firms
print(list(np.intersect1d(clus_kmeans_firms, clus_hier_firms)) )
The firms highlighted through K-means are a perfect subset of those highlighted through hierarchical clustering.
len(clus_kmeans_firms)
len(clus_hier_firms)
len(clus_dbscan_firms)
len(set(clus_kmeans_firms) & set(clus_hier_firms) & set(clus_dbscan_firms))
print(set(clus_kmeans_firms) & set(clus_hier_firms) & set(clus_dbscan_firms))
Out of the 325 firms in our dataset, 24 firms are flagged by k-means, 27 firms are flagged by DBSCAN and 32 firms are flagged by agglomerative clustering. Out of these, there are 20 firms in common, that are flagged by all three algorithms. Thus, we see that all three algorithms flag a similar set of firms as being in an anomalous cluster. As described above, for a conventional use case, DBSCAN() is expected to work best with data containing noise and outliers. However, our use case is unique in that it requires outliers to actually be treated differently from the ‘normal’ data points and hence we get a good performance with k-means as well. Additionally, this dataset is partly anonymised using a random multiplier, which may be distorting certain trends. It is important to note that we might need to tweak parameters like 'k' and ‘eps’, depending on the number of firms that can be supervised based on the availability of scarce resources for supervision.
Isolation forest can be employed here, to identify firms that are significantly different from the others in the population (i.e. outliers). In this model, randomly sub-sampled data is processed in a tree structure based on randomly selected features. The samples that travel deeper into the tree are less likely to be outliers as they needed more cuts to isolate them. However, the samples which ended up in shorter branches could indicate anomalies as it was easier for the tree to separate them from other observations. I am trying out this approach on data for a given year i.e. 2020.
df3_long_2020_if = df3_long_2020.copy()
Using MinMaxScaler()
scaler = preprocessing.MinMaxScaler()
df3_long_2020_scaled_if = scaler.fit_transform(df3_long_2020_if)
df3_long_2020_scaled_if.shape
df3_long_2020_scaled_if = pd.DataFrame(df3_long_2020_scaled_if, columns=df3_long_2020_if.columns, index = df3_long_2020_if.index)
df3_long_2020_scaled_if.head(2)
df3_long_2020_scaled_if = df3_long_2020_scaled_if.drop(['Grosscombinedratio','Grossexpenseratio','Netexpenseratio','Purenetclaimsratio',
'Excessofassetsoverliabilities(£m)[=equity]','NWP_abs_change_YoY',
'NWP_perc_change_YoY', 'SCRCovRatio_abs_change_YoY','SCRCovRatio_perc_change_YoY',
'Grossclaimsincur_abs_change_YoY',
'Grossclaimsincur_perc_change_YoY', 'Netcombratio_abs_change_YoY',
'Netcombratio_perc_change_YoY', 'NWP_abs_change_YoY_z',
'Grossclaimsincur_abs_change_YoY_z','Grossclaimsincur_perc_change_YoY_z',
'Profitable_Binary'],axis=1)
df3_long_2020_scaled_if.shape
df3_long_2020_scaled_if.columns
model=IsolationForest(n_estimators=100,max_samples=300,contamination=float(0.10),max_features=8,random_state=101)
model.fit(df3_long_2020_scaled_if)
df3_long_2020_scaled_if['scores']=model.decision_function(df3_long_2020_scaled_if)
df3_long_2020_scaled_if['anomaly']=model.predict(df3_long_2020_scaled_if.drop('scores',axis=1))
df3_long_2020_scaled_if['anomaly'].value_counts()
df3_long_2020_scaled_if[df3_long_2020_scaled_if['anomaly']==-1]
if_firms = (df3_long_2020_scaled_if[df3_long_2020_scaled_if['anomaly']==-1]).index
len(if_firms)
if_firms
print(set(if_firms) & set(clus_hier_firms))
len(set(if_firms) & set(clus_hier_firms))
len(set(if_firms) & set(clus_kmeans_firms))
len(set(if_firms) & set(clus_dbscan_firms))
print(set(if_firms) & set(clus_kmeans_firms) & set(clus_dbscan_firms) & set(clus_hier_firms))
len(set(if_firms) & set(clus_kmeans_firms) & set(clus_dbscan_firms) & set(clus_hier_firms))
print(set(if_firms) | set(clus_kmeans_firms) | set(clus_dbscan_firms) | set(clus_hier_firms))
len(set(if_firms) | set(clus_kmeans_firms) | set(clus_dbscan_firms) | set(clus_hier_firms))
An optimum allocation of supervisory resources is a key task for a regulatory authority such as the Prudential Regulatory Authority (PRA). The data provided is rich in financial metrics for insurance firms such as the Gross and Net Written Premium, Total assets and liabilities, various financial ratios, etc. It also has a time series component which has been aptly considered. With the aim of achieving the aforementioned motive, I have evaluated several dimensions including the firm’s size, dynamic business profile over time and anomalies from the norm using multiple approaches including exploratory data analysis and machine learning algorithms.
The data (325 firms in dataset 1 and 456 firms in dataset 2) is provided in an Excel (.xlsx) file in the wide format and spans a period of 5 years from 2016 till 2020. This is converted to the long format for ease of manipulation and further engineering, and certain processing steps are applied to it. The year is extracted from a string variable, column names are cleaned to remove spaces within them, and an inner join is performed to create a third data frame with variables from both datasets for the 325 common firms (here, the assumption is that Firm 1 in dataset 1 is the same as Firm 1 in dataset 2 and so on). A fourth data frame is created with data for the 131 firms that are present in dataset 2 but not in dataset 1. The data frames are explored using methods like info(), describe(), dtypes. All variables in the data are numeric except the firm number (e.g. ‘Firm 1’) which is later set as the index before running any algorithms on it. These data frames are checked for missing values and we see that they have a 100% fill rate for each variable, with no missing values.
I have conducted an Exploratory Data Analysis (EDA) through various types of data visualisations. My aim through this section was to get a better understanding of the data and create visualisations that focus on the priorities of supervisory resource allocation. I wanted to come up with certain ‘gating rules’ to weed out or identify the important or obviously ‘risky’ firms from the population. These kinds of rules can be used by supervisors while examining dashboards/reports to identify the obvious ‘target’ firms, before delving deeper into the results provided by more sophisticated machine learning algorithms on the data. Given that we primarily have numeric data for the 325 firms, I started with the pairplot which helps us visualise pairwise relationships between variables in our dataset. The pairplot indicates that there are several variables that are highly correlated with each other. Additionally, it is quite clearly visible that there are several outliers in the data as indicated by isolated data points on majority of the plots. However, due to the large number of plots, it is difficult to draw further conclusions from the pairplot, and hence I have analysed separate bivariate scatterplots in later sections of this report. I have then visualised the strength of the relationship between the variables (correlations) using a heatmap. The heatmap depicts direct (positive) correlations between certain variables such as Totalassets(£m) with Totalliabilities(£m), GWP(£m) with NWP(£m), Grossclaimsincurred(£m) with GrossBEL, EoFforSCR(£m) with Totalassets(£m), Totalliabilities(£m), SCR(£m), NWP(£m), GWP(£m), Excessofassetsoverliabilities(£m)(=equity), to name a few. We also see that the Netexpenseratio, Grosscombinedratio, Grossexpenseratio, Puregrossclaimsratio, Purenetclaimsratio are perfectly correlated with each other.
A deeper view into this is achieved through separate bivariate plots. Keeping in mind our first objective of giving more attention to the biggest firms, I have examined indicators of a firm’s size such as their Gross Written Premium, Net Written Premium, Total Assets, Total Liabilities, etc. The correlations from the heatmap support the usual idea that these variables are related to each other. Using plotly and cufflinks, I have created an interactive scatter plot of the total assets (£m) vs. total liabilities (£m) and concluded that firms with total assets or liabilities (in £m) more than 100,000 should be investigated. Based on the availability of resources, this threshold can be tweaked to supervise a larger or smaller number of firms in the corner buckets of these thresholds. Based on this cutoff, the supervisors should pay keen attention to 16 firms, which are Firm 10, Firm 101, Firm 105, Firm 151, Firm 199, Firm 210, Firm 216, Firm 247, Firm 26, Firm 298, Firm 311, Firm 34, Firm 4, Firm 6, Firm 7 and Firm 73. Another indicator of the size of the firm could be its turnover i.e. the Gross Written Premium and derived from it, the Net Written Premium (in £m). From the interactive plot of GWP (£m) vs. NWP (£m), we see that supervisors should focus on firms with GWP (£m) of more than 10,000 or NWP (£m) of more than 5,000. The 25 firms that satisfy these criteria include Firm 10, Firm 101, Firm 104, Firm 105, Firm 120, Firm 131, Firm 151, Firm 17, Firm 199, Firm 210, Firm 23, Firm 234, Firm 247, Firm 25, Firm 26, Firm 276, Firm 30, Firm 311, Firm 34, Firm 38, Firm 4, Firm 51, Firm 6, Firm 7, Firm 73. Between the 16 firms flagged using total assets/total liabilities and 25 firms flagged using GWP/NWP, we see that there is a large overlap of 14 firms flagged by both criteria. Another simpler way to examine this is to identify the top 'X' (10/20/30 and so on) firms based on their size, for investigation. 'X' can be adjusted based on resource availability. In 2020, the top 10 firms based on turnover or total revenue written by the insurer (GWP(£m)) were Firm 210, Firm 4, Firm 311, Firm 34, Firm 7, Firm 151, Firm 26, Firm 247, Firm 25 and Firm 73.
In order to derive further insights from the data, I have engineered several features such as the size of buffer (which is the surplus over 100% of the SCR coverage ratio), whether or not the firm is profitable based on its net combined ratio, the absolute and percentage change year-on-year for several key metrics such as Net Written Premium, Gross claims incurred, etc. On analysing a plot of the size of buffer, I see clear anomalies which look more like cases of erroneous reporting of the SCR coverage ratio, and the same has been explored in a later section of the report. In order to evaluate our second priority on the changing business profiles of firms year-on-year, I wanted to identify firms whose data are changing substantially year-on-year. I have leveraged the derived variables such as the absolute and percentage change year-on-year for several key metrics such as Net Written Premium, Gross claims incurred, etc. I have cherry-picked variables based on business knowledge and those that are correlated with several others in the dataset, such that they can capture the main trends in the data. I have constructed a boxplot for each of them to see their distributions and the presence of outliers. The boxplot of the percentage change year-on-year in the SCR coverage ratio is another indicator that supplements our hypothesis of erroneous values in the SCR coverage ratio.
I have primarily used two methods to identify firms that are dynamic and their data are changing significantly. These are the z-score and interquartile range. Through the z-score approach, there are 13 firms (Firm 1, Firm 104, Firm 105, Firm 131, Firm 151, Firm 199, Firm 210, Firm 247, Firm 26, Firm 276, Firm 311, Firm 4, Firm 7) that are flagged as having an absolute change in their Net Written Premium (year-on-year) which is greater than 2.5 standard deviations from the mean. I went ahead to validate the inferences from the z-score analysis by doing a sample study for the flagged Firm 104. It has an unusual spike in NWP(£m) from 63.76 in 2017 to 18,870.46 in 2018 and then a fall back to 1,697.62 in 2019, indicating that there are drastically changing year-on-year trends for this Firm (which are not gradual) and this must be investigated. I have used a similar z-score analysis to identify 18 firms (Firm 105, Firm 112, Firm 158, Firm 17, Firm 200, Firm 216, Firm 22, Firm 234, Firm 25, Firm 261, Firm 275, Firm 283, Firm 286, Firm 37, Firm 49, Firm 52, Firm 74, Firm 96) changing by large amounts based on their gross claims incurred year-on-year. An insight from the raw data study for the highlighted Firm 216 is that we see an unusual dip of Gross claims incurred (£m) from 6,844.01 in 2018 to 0.00 in 2019 and 2020. The same analysis can be done on several other key indicator variables to see which firms are changing year-on-year. An interesting aspect to highlight here is that when looking at a similar z-score analysis on the percentage change in gross claims incurred, a large percentage change may still correspond to a small absolute change in value (for example, from 0.0 in 2019 to 21.45 in 2020 for Firm 29). This may not warrant too much attention of supervisors because of the minor volume. Therefore, we need to take into account the volume as well while looking at the percentage i.e. we must look at variables in conjunction with each other. Therefore, the z-score analysis with a threshold of 2.5, is a good indicator of drastically changing firms. On analysing using the interquartile range, it flagged a large number (~50%) firms as requiring attention based on the absolute change in NWP, and hence the z-score analysis would be a more suitable approach for our business case. Following a firm-wise view, I looked at an aggregate level chart for all firms to see their changes in gross claims incurred over time, and derived another insight that the gross claims incurred over time are reducing, from 2018 to 2020. I validated this conclusion using measures of central tendency such as the mean and median.
The third priority for supervisory resource allocation was to identify outliers from the norm i.e. for a single reporting period, the firms that deviate significantly from the average. As the data visualisations created above indicated the presence of grave outliers in our dataset, I decided to further validate this using machine learning techniques. I have also created bivariate scatter plots for a single reporting period to validate the conclusions drawn from our graphical analyses of total assets (£m) vs. total liabilities (£m) and GWP(£m) vs. NWP(£m), and similar trends are visible here. For machine learning, I have chosen the year 2020 as the period for analysis because through prior discussions and general understanding, I believe that supervisors may tend to focus their attention on the most recent data for firms, and our dataset has 2020 as the latest year. However, the same methods can be replicated on the previous years' data as well. Before we perform machine learning techniques on the data, we need to perform certain preprocessing steps as the variables have different scales, with certain features being in millions, while some others being ratios. This can be achieved through scaling. I have tested the RobustScaler(), StandardScaler() and MinMaxScaler() for this, and proceeded with the MinMaxScaler() as it had a superior performance on our data. The underlying reason for this could be that the robust scaler uses the interquartile range so that it is robust to outliers, and hence the scaled data for outliers and the ‘normal’ data points are not very different, thereby making it harder for the clustering algorithm to separate them into a different cluster. However, the MinMaxScaler() uses the maximum and minimum values for a variable, so the outliers would correspond to the maximum and minimum scaled values, while the ‘normal’ data points will be squished into a narrower range, making it easier for the clustering algorithm to differentiate between them.
Our data is unlabelled and hence requires unsupervised learning to be used. I decided to apply two machine learning techniques to identify the anomalies – clustering and isolation forest. With respect to clustering, I wanted to analyse whether the algorithm is able to weed out anomalous observations into a few smaller clusters, and these are the firms that may need to be investigated by supervisors. Even though clustering doesn't rely on linear assumptions, and thus collinearity wouldn't cause major issues, it might still lead to overly redundant features, large computational requirements and we may be using more data than required to reach the same patterns. Additionally, as the data in the PRA for supervision can be large, we need to be conscious of ensuring an optimum utilisation of computational resources. Hence, based on business knowledge and high correlations, I dropped a few variables from the dataset. The final variables retained in the dataset for machine learning were GWP(£m), NWP(£m), SCR(£m), SCRcoverageratio, Totalassets(£m), Totalliabilities(£m), GrossBEL(inc.TPsaswhole,pre-TMTP)(£m), Grossclaimsincurred(£m), NetBEL(inc.TPsasawhole,pre-TMTP)(£m), Netcombinedratio, Puregrossclaimsratio, EoFforSCR(£m) and Size_of_Buffer.
The first approach I explored is k-means clustering, which is a method of partitional clustering. It is faster, more computationally efficient and a provides a good initial view of clusters in our data. An important decision in k-means is the subjective choice of the number of clusters (k) beforehand, which I made using the elbow plot and silhouette coefficient. I tried clustering with k as 3, 4 and 6. In my opinion, the best choice for the value of k for this dataset is 3, as the inertia is low and the number of iterations to converge is half of that at 6 clusters. Additionally, the outlying firms are sitting in two distinct clusters out of 3 (as seen through the firm counts and in the plots). Since our data is multivariate i.e. it has many dimensions, the best way to plot the clusters would be by reducing it to two dimensions using Principal Component Analysis (PCA) and then plotting it. The k-means algorithm flagged 24 firms (out of the 325 input firms) as being grouped in the two ‘anomalous’ clusters.
| k | Inertia | No. of Iterations to Converge |
|---|---|---|
| 3 | 19.3 | 6 |
| 4 | 22.5 | 5 |
| 6 | 12.3 | 12 |
I then proceeded to test density-based and hierarchical clustering methods. For density-based clustering, I used DBSCAN (Density Based Spatial Clustering of Applications with Noise). DBSCAN tends to be robust to the presence of noise and outliers in the data, while k-means clustering could be sensitive to outliers. The possible reason why k-means is performing well on our data, is that our aim is to give a differential treatment to outliers (essentially, to identify them), and not treat them as the ‘normal’ data points, which may be desired in most other use cases where k-means is used on a conventional dataset that has outliers. In other words, in k-means, since the cluster centroids are the means, there will be very few observations in the clusters where the 'outlier' firms sit, thereby facilitating their isolation. For DBSCAN, the choice of minPts is subjective but is advised to be 2 times the number of dimensions, and hence I chose a value of 26 (our data has 13 features). I leveraged the distance from the k-nearest neighbours algorithm to choose an optimal value of eps as 0.3. The results from density-based clustering were two clusters, with one containing the outlying firms (27 firms out of the 325 firms input into the model). I validated this by seeing the intersection of the anomalous firms highlighted by k-means, and saw that there is an overlap of 20 firms between the two algorithms.
For agglomerative hierarchical clustering, I created the dendrogram to make a choice of the number of clusters, which I chose at 3. The hierarchical clustering algorithm flags 32 firms in the two ‘anomalous’ clusters, which I’d recommend supervisors to pay attention to. Additionally, the firms highlighted through k-means clustering are a perfect subset of those highlighted through hierarchical clustering. Hence, out of the 325 firms in our dataset, 24 firms are flagged by k-means, 27 firms are flagged by DBSCAN and 32 firms are flagged by agglomerative clustering. Out of these, there are 20 firms in common, that are flagged by all three algorithms. Thus, we see that all three algorithms flag a similar set of firms as being in isolated clusters. As described above, for a conventional use case, DBSCAN() is expected to work best with data containing noise and outliers. However, our use case is unique in that it requires outliers to be treated differently from the ‘normal’ data points and hence we get a good performance with k-means as well. Additionally, this dataset is partly anonymised using a random multiplier, which may be distorting certain trends. It is important to note that we might need to tweak parameters like 'k' and ‘eps’, depending on the number of firms that can be supervised based on the availability of scarce resources for supervision.
In my next approach, I leveraged the isolation forest algorithm for anomaly detection. In this model, the samples that travel deeper into the tree are less likely to be outliers as they need more cuts to isolate them. However, the samples which end up in shorter branches could indicate anomalies as it was easier for the tree to separate them from other observations. I have used the same dataset for the year 2020 (with 325 firms and 13 features) to fit this model. I have assumed a contamination of 10% in the data, but this parameter is important for the model and can be tweaked as per the number of firms that there is capacity to supervise. The model flags 33 firms as being anomalies. Out of these 33 firms, there is an overlap with 22 (out of 24) flagged by the k-means clustering algorithm, an overlap with 25 (out of 27) flagged by the DBSCAN algorithm and an overlap with 25 (out of 32) flagged by agglomerative clustering.
In conclusion, the clustering and isolation forest machine learning techniques perform well in identifying a common pool of 19 firms to bring to the supervisors’ attention. The overlap in results from various algorithms indicates that they are robust and can be used independently as well, depending on the availability of time and resources. If a union of the firms highlighted by the three clustering methods and the isolation forest algorithm is considered, the set consists of 41 firms to investigate, which is a significant reduction in resource consumption compared to that required for the original set of 325 firms. The table below provides a summary of the results from each technique. Such a view can be generated for supervisors after a consensus on the methods to be implemented is reached (as there is a large overlap of results indicating robustness of each technique independently).
| Technique | Number of Firms Flagged for Investigation | Firms Flagged for Investigation |
|---|---|---|
| Graphical Analysis: Total assets (£m) vs. Total liabilities (£m) | 16 | Firm 10,Firm 101,Firm 105,Firm 151,Firm 199,Firm 210,Firm 216,Firm 247,Firm 26,Firm 298,Firm 311,Firm 34,Firm 4,Firm 6,Firm 7,Firm 73 |
| Graphical Analysis: NWP(£m) vs. GWP(£m) | 25 | Firm 10, Firm 101, Firm 104, Firm 105, Firm 120, Firm 131, Firm 151, Firm 17, Firm 199, Firm 210, Firm 23, Firm 234, Firm 247, Firm 25, Firm 26, Firm 276, Firm 30, Firm 311, Firm 34, Firm 38, Firm 4, Firm 51, Firm 6, Firm 7, Firm 73 |
| z-score: Absolute change in NWP (£m) | 13 | Firm 1, Firm 104, Firm 105, Firm 131, Firm 151,Firm 199, Firm 210, Firm 247, Firm 26, Firm 276,Firm 311, Firm 4, Firm 7 |
| z-score: Absolute change in Gross Claims Incurred (£m) | 18 | Firm 105, Firm 112, Firm 158, Firm 17, Firm 200,Firm 216, Firm 22, Firm 234, Firm 25, Firm 261,Firm 275, Firm 283, Firm 286, Firm 37, Firm 49,Firm 52, Firm 74, Firm 96 |
| Clustering: Partitional (K-means) | 24 | Firm 10, Firm 105, Firm 112, Firm 151, Firm 158, Firm 17,Firm 199, Firm 210, Firm 22, Firm 247, Firm 26, Firm 261,Firm 280, Firm 283, Firm 286, Firm 298, Firm 304, Firm 311,Firm 34, Firm 4, Firm 52, Firm 6, Firm 7, Firm 73 |
| Clustering: Density-Based (DBSCAN) | 27 | Firm 10, Firm 105, Firm 112, Firm 127, Firm 151, Firm 158,Firm 166, Firm 17, Firm 199, Firm 210, Firm 22, Firm 228,Firm 26, Firm 270, Firm 283, Firm 284, Firm 286, Firm 304,Firm 311, Firm 34, Firm 4, Firm 52, Firm 6, Firm 7,Firm 72, Firm 73, Firm 99 |
| Clustering: Hierarchical (Agglomerative) | 32 | Firm 10, Firm 105, Firm 112, Firm 151, Firm 158, Firm 17,Firm 188, Firm 199, Firm 208, Firm 210, Firm 22, Firm 247,Firm 26, Firm 261, Firm 276, Firm 280, Firm 283, Firm 286,Firm 295, Firm 298, Firm 30, Firm 304, Firm 311, Firm 34,Firm 4, Firm 51, Firm 52, Firm 6, Firm 7, Firm 73,Firm 74, Firm 81 |
| Isolation Forest | 33 | Firm 10, Firm 105, Firm 112, Firm 127, Firm 151, Firm 158,Firm 166, Firm 17, Firm 188, Firm 199, Firm 210, Firm 22,Firm 228, Firm 234, Firm 247, Firm 25, Firm 26, Firm 280,Firm 283, Firm 284, Firm 295, Firm 298, Firm 30, Firm 304,Firm 311, Firm 34, Firm 4, Firm 52, Firm 6, Firm 7,Firm 72, Firm 73, Firm 99 |
| Common firms flagged by all clustering techniques and isolation forest | 19 | Firm 10, Firm 6, Firm 105, Firm 151, Firm 199, Firm 4, Firm 158, Firm 112, Firm 73, Firm 34, Firm 22, Firm 52, Firm 311, Firm 26, Firm 210, Firm 17, Firm 304, Firm 7, Firm 283 |
| Firms flagged by at least one clustering technique or isolation forest | 41 | Firm 228, Firm 30, Firm 280, Firm 22, Firm 166, Firm 295, Firm 6, Firm 112, Firm 158, Firm 286, Firm 26, Firm 188, Firm 7, Firm 81, Firm 105, Firm 234, Firm 270, Firm 73, Firm 261, Firm 52, Firm 208, Firm 74, Firm 298, Firm 51, Firm 304, Firm 72, Firm 25, Firm 10, Firm 247, Firm 151, Firm 199, Firm 4, Firm 276, Firm 127, Firm 34, Firm 284, Firm 311, Firm 210, Firm 17, Firm 99, Firm 283 |
Kindly Note: The machine learning algorithms are on a subset of data for the Year 2020, while the graphical and z-score analyses are on the entire dataset from 2016 - 2020
From the above analyses, we have seen certain variables that have extremely small or large values, indicating that they may not be outliers, but errors in reporting instead. Errors in reporting could include denomination issues (e.g. misinterpretation of the expected denomination of the reporting entry, such as reporting in thousands rather than in millions), or reporting in the incorrect box or even a misinterpretation of the meaning of a reporting template box. Such errors make the data implausible and we need to query the same with the reporters.
df3_long_err = df3_long.copy()
df3_long_err['SCRcoverageratio'].iplot(kind='box')
df3_long_err[df3_long_err['SCRcoverageratio']>99930290][['Firm','Year','SCRcoverageratio']]
df3_long_err['SCRcoverageratio'].describe()
For a variable that has a median of 1.5875 and Q3 of 3.0612 (not looking at the mean here as it is highly affected by the erroneous value), a value of 9.993029e+08 or 9.269834e+08 is definitely erroneous. In such a case, we need to definitely go back to the firms (Firm 131 and Firm 216) and highlight this. Using the interactive graph below for each year, to see if this has occurred in multiple years or for multiple firms.
fig = px.line(df3_long_err,x='Firm',y='SCRcoverageratio',color='Year')
fig.show()
- Firm 320 in the year 2016
- Firms 1, 131 and 216 in the year 2017
- Firm 66 in the year 2018
Doing a similar graphical analysis for gross combined ratio:
df3_long_err['Grosscombinedratio'].iplot(kind='box')
df3_long_err[df3_long_err['Grosscombinedratio']>3500000][['Firm','Year','Grosscombinedratio']]
df3_long_err['Grosscombinedratio'].describe()
For a variable that has a median of 1.395327e-01 and Q3 of 1.000998e+00 (not looking at the mean here as it is highly affected by the erroneous value), a value of 3.978266e+06 is definitely erroneous. In such a case, we need to definitely go back to the firm (Firm 188) and highlight this. Using the interactive graph below for each year, to see if this has occurred in multiple years or for multiple firms.
fig = px.line(df3_long_err,x='Firm',y='Grosscombinedratio',color='Year')
fig.show()
- Firm 188 in the year 2019
- Firm 28 in the year 2018
From the above analyses, we have seen certain variables that have extremely small or large values, indicating that they may not be outliers, but errors in reporting instead. Errors in reporting could include denomination issues (e.g. misinterpretation of the expected denomination of the reporting entry, such as reporting in thousands rather than in millions), or reporting in the incorrect box or even a misinterpretation of the meaning of a reporting template box. Such errors make the data implausible and we need to query the same with the reporters.
I have done a graphical analysis followed by raw data validation for this. As witnessed through plots of the size of buffer and SCR coverage ratio in the above section, the box plot of the SCR coverage ratio shows an extreme value at 0.9993bn. For a ratio variable that has a median of 1.5875, Q1 of 0.2913 and a Q3 of 3.0612, a value close to 0.9bn is definitely erroneous. I am not considering the mean as it is highly affected by this erroneous value. The raw data shows such values for Firms 131 and 216 in the year 2017, which have been flagged in our graphical analyses and z-score checks done above. The ML techniques were applied on 2020 data, so this datapoint is not present there, else this should have been flagged through the clustering and isolation forest algorithms as well. A firm-wise interactive line graph of the SCR coverage ratio for each year shows us that Firm 320 in the year 2016, Firms 1, 131 and 216 in the year 2017 and Firm 66 in the year 2018 could be cases of erroneous reporting and not just outliers.
A similar graphical analysis of the Gross combined ratio shows a value of 3.98mn for Firm 188 in the year 2019 and a value of 16,023.6 for Firm 28 in the year 2018. For a variable that has a median of 0.1395, a Q1 of 0 and a Q3 of 1, these definitely seem like an erroneous values, and not just outliers in the data. Based on the analysis, I would recommend supervisors to pay careful attention to these firms and possibly request them to correct and resubmit their data.
The Underwriting dataset had certain firms that were not present in the general dataset. Hence, the metrics available for these firms are limited. However, I have tried to prioritise the firms to be investigated here using a few methods as shown below:
Again, filtering the data to retain information for the year 2020 only.
df4_long_2020 = df4_long[df4_long['Year']==2020]
df4_long_2020.columns
df4_long_2020.shape
fig = px.line(df4_long_2020,x='Firm',y=['GrossBEL(inc.TPsaswhole,pre-TMTP)(£m)',
'Grossclaimsincurred(£m)', 'Grosscombinedratio', 'Grossexpenseratio',
'NetBEL(inc.TPsasawhole,pre-TMTP)(£m)', 'Netcombinedratio',
'Netexpenseratio', 'Puregrossclaimsratio', 'Purenetclaimsratio'])
fig.show()
From the above graph, we see that Firm 449 is a clear outlier and a possible case of erroneous reporting for the variable Grosscombinedratio. Similarly, Firm 418 seems to be an outlier for Purenetclaims ratio and Netcombined ratio. Thus, Firms 418 and 449 must be investigated.
Bivariate Plot of Grosscombinedratio vs. Netcombinedratio
df4_long_2020.iplot(kind='scatter',x='Grosscombinedratio',y='Netcombinedratio',mode='markers',color='blue'
,title='Gross combined ratio vs. Net combined ratio for the Year 2020',xTitle='Gross Combined Ratio',yTitle='Net Combined Ratio')
df4_long_2020[(df4_long_2020['Grosscombinedratio']< -2) | (df4_long_2020['Netcombinedratio']<-1000) | (df4_long_2020['Netcombinedratio']>59)]
Bivariate Plot of GrossBEL vs. NetBEL
df4_long_2020.iplot(kind='scatter',x='GrossBEL(inc.TPsaswhole,pre-TMTP)(£m)',y='NetBEL(inc.TPsasawhole,pre-TMTP)(£m)',
mode='markers',color='red'
,title='Gross Best Estimate Liability vs. Net Best Estimate Liability for the Year 2020',xTitle='Gross BEL',yTitle='Net BEL')
A bivariate scatter plot of the Gross Best Estimate Liability vs. Net Best Estimate Liability for the Year 2020 shows that firms whose GrossBEL(inc.TPsaswhole,pre-TMTP)(£m) or NetBEL(inc.TPsasawhole,pre-TMTP)(£m) is greater than 1000, should be focused on. The 15 firms flagged here include Firm 329, Firm 337, Firm 347, Firm 349, Firm 351, Firm 359, Firm 362, Firm 387, Firm 393, Firm 397, Firm 403, Firm 427, Firm 428, Firm 429, Firm 451.
(df4_long_2020[(df4_long_2020['GrossBEL(inc.TPsaswhole,pre-TMTP)(£m)']>1000) | (df4_long_2020['NetBEL(inc.TPsasawhole,pre-TMTP)(£m)']>1000)])['Firm'].unique()
fig, ax = plt.subplots(figsize=(30, 20))
sns.heatmap(df4_long.corr(),annot=True,cmap="YlGnBu")
Dropping Netexpenseratio as it is perfectly correlated with Netcombined ratio. Dropping Puregrossclaims ratio as it is highly correlated with Grosscombinedratio.
df4_long_2020_c = df4_long_2020.copy()
df4_long_2020_c = df4_long_2020_c.drop(['Netexpenseratio','Puregrossclaimsratio','Year'],axis=1)
df4_long_2020_c = df4_long_2020_c.set_index('Firm')
df4_long_2020_c.columns
Using MinMaxScaler()
scaler = preprocessing.MinMaxScaler()
df4_long_2020_c_scaled = scaler.fit_transform(df4_long_2020_c)
df4_long_2020_c_scaled = pd.DataFrame(df4_long_2020_c_scaled, columns=df4_long_2020_c.columns, index = df4_long_2020_c.index)
df4_long_2020_c_scaled.head()
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 101,
}
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df4_long_2020_c_scaled)
sse.append(kmeans.inertia_)
plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
The graph shows an elbow point at 4, 5 and 6 clusters, but does not completely flatten. However, it is not a clear elbow, making the choice of the number of clusters a subjective decision.
kl = KneeLocator(range(1, 11), sse, curve="convex", direction="decreasing")
kl.elbow
# Using the silhouette coefficient
silhouette_coefficients = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(df4_long_2020_c_scaled)
score = silhouette_score(df4_long_2020_c_scaled, kmeans.labels_)
silhouette_coefficients.append(score)
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
The best choice for k is 4 as it has a high silhouette score.
# Fitting the k-means algorithm to our data
kmeans = KMeans(
init="random",
n_clusters=4,
n_init=10,
max_iter=300,
random_state=101)
kmeans.fit(df4_long_2020_c_scaled)
The kmeans.inertia_ gives us the lowest SSE value i.e. the lower the model inertia, the better the fit. An inertia of 3.78 indicates a good fit.
kmeans.inertia_
# Final locations of the centroid
kmeans.cluster_centers_
# The number of iterations required to converge
kmeans.n_iter_
kmeans.labels_
Our data is multivariate i.e. it has many dimensions. The best way to plot the clusters would be by reducing it to two dimensions using Principal Component Analysis (PCA) and then plotting it.
df4_long_2020_c_scaled['clusters'] = kmeans.fit_predict(df4_long_2020_c_scaled)
pca_num_components = 2
reduced_data = PCA(n_components=pca_num_components).fit_transform(df4_long_2020_c_scaled)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'],index=df4_long_2020_c_scaled.index)
results['clusters'] = df4_long_2020_c_scaled['clusters']
results['clusters'].value_counts()
fig = px.scatter(results,x='pca1',y='pca2',color='clusters')
fig.show()
df4_long_2020_c['clusters'] = df4_long_2020_c_scaled['clusters']
km_firms = (df4_long_2020_c[(df4_long_2020_c['clusters']==1) | (df4_long_2020_c['clusters']==2) | (df4_long_2020_c['clusters']==3)]).index
print(km_firms)
len(km_firms)
model=IsolationForest(n_estimators=100,max_samples=100,contamination=float(0.10),max_features=5,random_state=101)
df4_long_2020_c_scaled = df4_long_2020_c_scaled.drop('clusters',axis=1)
model.fit(df4_long_2020_c_scaled)
df4_long_2020_c_scaled['scores']=model.decision_function(df4_long_2020_c_scaled)
df4_long_2020_c_scaled['anomaly']=model.predict(df4_long_2020_c_scaled.drop('scores',axis=1))
df4_long_2020_c_scaled['anomaly'].value_counts()
df4_long_2020_c_scaled[df4_long_2020_c_scaled['anomaly']==-1]
if_firms = (df4_long_2020_c_scaled[df4_long_2020_c_scaled['anomaly']==-1]).index
print(if_firms)
print(list(np.intersect1d(if_firms, km_firms)) )
We should note here that the firms obtained as outliers through the unsupervised machine learning algorithms are in line with those flagged in the graphical representations. While this is just a small subsample, the same analysis can be scaled to a large number of firms and the number of clusters in k-means or the contamination parameter in isolation forest can be adjusted accordingly.
The Underwriting dataset (data frame 2) has 131 firms that were not present in the general dataset. Hence, the metrics available for these firms are limited. However, I have tried to highlight the firms to be investigated here using a few methods, with the dataset being limited to the year 2020.
A graphical analysis of these variables shows that Firm 449 has an outlying value of -12,033.4 for its Gross combined ratio. Similarly, Firm 418 seems to have anomalous values for the Pure net claims ratio and the net combined ratio. Therefore, the supervisors should pay keen attention to these firms. A bivariate scatter plot of the Gross Best Estimate Liability vs. Net Best Estimate Liability for the Year 2020 shows that firms whose GrossBEL(inc.TPsaswhole,pre-TMTP)(£m) or NetBEL(inc.TPsasawhole,pre-TMTP)(£m) is greater than 1000, should be focused on. The 15 firms flagged here include Firm 329, Firm 337, Firm 347, Firm 349, Firm 351, Firm 359, Firm 362, Firm 387, Firm 393, Firm 397, Firm 403, Firm 427, Firm 428, Firm 429, Firm 451.
In order to perform machine learning techniques i.e. k-means clustering and isolation forest on this dataset, similar considerations for correlated variables and the choice of scaler are used. The final variables in the dataset are GrossBEL(inc.TPsaswhole,pre-TMTP)(£m), Grossclaimsincurred(£m), Grosscombinedratio, Grossexpenseratio, NetBEL(inc.TPsasawhole,pre-TMTP)(£m), Netcombinedratio and Purenetclaimsratio. K-means clustering gives us good results with an inertia of 3.78 for a value of k as 4. Out of the 131 firms in the input data, there are 11 anomalous firms that are clustered into 3 clusters. These are Firm 337, Firm 349, Firm 359, Firm 362, Firm 387, Firm 418, Firm 427, Firm 428, Firm 429, Firm 449, Firm 451. We see a major overlap with the firms highlighted through the graphical analyses. An isolation forest model fit to this data flags 13 firms (out of 131) as anomalies. These are Firm 337, Firm 347, Firm 359, Firm 362, Firm 364, Firm 387, Firm 403, Firm 418, Firm 427, Firm 428, Firm 449, Firm 451, Firm 455. Again, we see a major overlap of 9 firms (out of 11) with the k-means clustering algorithm.
We should note here that the firms obtained as outliers through the unsupervised machine learning algorithms are in line with those flagged in the graphical analyses. While this is just a small subsample, the same analysis can be scaled to a larger number of firms and the number of clusters in k-means or the contamination parameter in isolation forest can be adjusted accordingly.